SYS-CON MEDIA Authors: Yeshim Deniz, Pat Romanski, Elizabeth White, Carmen Gonzalez, Zakia Bouachraoui

Blog Feed Post

Big Data and Analytics -- Apache Pig

Big Data Analytics Introduction to Pig

1    Apache Pig

In a previous post, we started talking about big data and analytics, and we started with Hadoop installation. in this post we will cover Apache Pig, how to install and configure it, and then illustrates two use cases (Airline, and Movies data sets)

1.1    Overview

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject).
In this section, we will walk through the Pig framework, its installation, and configuration and will apply some use cases

1.2    Tools and Versions

-          Apache Pig 0.13.0
-          Ubuntu 14.04 LTS
-          Java 1.7.0_65 (java-7-openjdk-amd64)
-          Hadoop 2.5.1

1.3    Installation and Configurations

1-      Download Pig tar file using:
              wget http://apache.mirrors.pair.com/pig/latest/pig-0.13.0.tar.gz

2-      Untar the file and move its contents to /usr/local/pig
               tar -xzvf pig-0.13.0.tar.gz
               mv pig-013.0/ /usr/local/pig
3-      Edit the bashrc file and add pig to the system path


4-      Now, we can check Pig help command to get list of available commands:
               pig –help
the following will be displayed:




5-      And to start Pig Grunt shell we run the pig command:
             Pig





1.4    Pig Running Modes:

Pig has two run modes or exectypes:
Ø  Local Mode: To run Pig in local mode. And this can be executed using the following command:

              $pig -x local
              e.g. $pig –x local test.pig

Ø  Mapreduce Mode: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. And this can be started by the following:
$ pig
Or
$ pig -x mapreduce
e.g. $ pig test.pig
or
$ pig -x mapreduce test.pig

Using either mode, we can run the Grunt shell, Pig scripts, or embedded programs.

1.5    Use Case 1 (Airline DataSet)

In this use case, we used the airline dataset exist at this location: http://stat-computing.org/dataexpo/2009/the-data.html
In particular the flight data related to year 1987
And here is the Pig Script used to load the dataset file and get the total miles travelled:

records = LOAD '1987.csv' USING PigStorage(',') AS
                (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);

milage_recs = GROUP records ALL;

tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);

STORE tot_miles INTO '/home/mohamed/totalmiles';
And here is a screen shot of running this in local mode:



Pig Local mode executed for the year 1987 using the dataset 1987.csv  and the output of the total miles was written after the Pig script execution to the following file:part-r-00000
The results: 775009272


And here are some screen shots for running the same example using MapReduce mode:

But first, we need to copy the airline dataset to the HDFS directory for Pig to process:

hdfs dfs -copyFromLocal /home/mohamed/1987.csv /user/mohamed/1987.csv
















1.6    Use Case 2 (Movies DataSet)

In this case I used the movies dataset downloaded from this location:
In the Grunt shell we enter the following commands:
        grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
        grunt> DUMP movies;
The output will look like the following:



       grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

       The DUMP of movies_greater_than_four will result the following:



And we can also try the following:
             grunt> sample_10_percent = sample movies 0.1;
             grunt> dump sample_10_percent;
This sample keyword is used to get sample set from data (in this case 10%)



We can also run different other commands over this dataset, like ORDER BY, DISTINCT, GROUP etc.

One last command that I tried is ILUSTRATE, this command is used to view the step-by-step execution of a sequence of statements; hers is an example:

          grunt> ILLUSTRATE movies_greater_than_four;





1.7    Issues and problems:


Ø  The path to the pig script output file after the STORE element should be string quoted not like what mentioned in the Book without quotes, as the Pig Latin compiler complains about that.

Ø  Pig grunt is not working even I ran the export path command, but it worked well when I added the path to bashrc file and sourced it.

      We reached the end of our second post on big data and analytics, hope you enjoyed reading it and experimenting with Pig installation and configuration. next post will be about Apache HBase.




Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

Latest Stories
Wasabi is the hot cloud storage company delivering low-cost, fast, and reliable cloud storage. Wasabi is 80% cheaper and 6x faster than Amazon S3, with 100% data immutability protection and no data egress fees. Created by Carbonite co-founders and cloud storage pioneers David Friend and Jeff Flowers, Wasabi is on a mission to commoditize the storage industry. Wasabi is a privately held company based in Boston, MA. Follow and connect with Wasabi on Twitter, Facebook, Instagram and the Wasabi blog...
Today we introduced our New York & Silicon Valley combo sponsorship and exhibit opportunities with unmatched pre and post-show promotion. At CloudEXPO | DevOpsSUMMIT | DXWorldEXPO NY & CA, Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune...
Addteq is a leader in providing business solutions to Enterprise clients. Addteq has been in the business for more than 10 years. Through the use of DevOps automation, Addteq strives on creating innovative solutions to solve business processes. Clients depend on Addteq to modernize the software delivery process by providing Atlassian solutions, create custom add-ons, conduct training, offer hosting, perform DevOps services, and provide overall support services.
You want to start your DevOps journey but where do you begin? Do you say DevOps loudly 5 times while looking in the mirror and it suddenly appears? Do you hire someone? Do you upskill your existing team? Here are some tips to help support your DevOps transformation. Conor Delanbanque has been involved with building & scaling teams in the DevOps space globally. He is the Head of DevOps Practice at MThree Consulting, a global technology consultancy. Conor founded the Future of DevOps Thought Leade...
David Friend is the co-founder and CEO of Wasabi, the hot cloud storage company that delivers fast, low-cost, and reliable cloud storage. Prior to Wasabi, David co-founded Carbonite, one of the world's leading cloud backup companies. A successful tech entrepreneur for more than 30 years, David got his start at ARP Instruments, a manufacturer of synthesizers for rock bands, where he worked with leading musicians of the day like Stevie Wonder, Pete Townsend of The Who, and Led Zeppelin. David has ...
When applications are hosted on servers, they produce immense quantities of logging data. Quality engineers should verify that apps are producing log data that is existent, correct, consumable, and complete. Otherwise, apps in production are not easily monitored, have issues that are difficult to detect, and cannot be corrected quickly. Tom Chavez presents the four steps that quality engineers should include in every test plan for apps that produce log output or other machine data. Learn the ste...
With the mainstreaming of IoT, connected devices, and sensors, data is being generated at a phenomenal rate, particularly at the edge of the network. IDC's FutureScape for IoT report found that by 2019, 40% of IoT data will be stored, processed, analyzed and acted upon at the edge of the network where it is created. Why at the edge? Turns out that sensor data, in most cases, is perishable. Its value is realized within a narrow window after its creation. Further, analytics at the edge provides o...
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
Enterprises that want to take advantage of the Digital Economy are faced with the challenge of addressing the demands of multi speed IT and omni channel enablement. They are often burdened with applications that are complex, brittle monoliths. This is usually coupled with the need to remediate an existing services layer that is not well constructed with inadequate governance and management. These enterprises need to face tremendous disruption as they get re-defined and re-invented to meet th...
Steadfast specializes in flexible cloud environments, infrastructure hosting, and a full suite of reliable managed services and security. Complemented by expert consultation at all stages of design and deployment to maintenance and expansion planning, Steadfast delivers high-quality, cost-effective IT infrastructure solutions, personalized to customer needs.
The DevOps dream promises faster software releases while fostering collaborating and improving quality and customer experience. Docker provides the key capabilities to empower DevOps initiatives. This talk will demonstrate practical tips for using Atlassian tools like Trello, Bitbucket Pipelines and Hipchat to achieve continuous delivery of Docker based containerized applications. We will also look at how ChatOps enables conversation driven collaboration and automation for self provisioning clou...
Gym Solutions is a software as a service (SaaS) solution purpose-built to service the fitness industry, with over 1000 clients servicing over 2 million gym members across 40 countries making Perfect Gym Solutions one of the largest and fastest growing SaaS applications in the fitness industry. Their platform is a comprehensive package of modern modular products for the management of fitness clubs, sports facilities and gyms an end- to end solution, revolutionising the way that gyms are managed. ...
Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He has been a regular contributor to both InformationWeek and CIO Insight...
Andi Mann, Chief Technology Advocate at Splunk, is an accomplished digital business executive with extensive global expertise as a strategist, technologist, innovator, marketer, and communicator. For over 30 years across five continents, he has built success with Fortune 500 corporations, vendors, governments, and as a leading research analyst and consultant.
The current environment of Continuous Disruption requires companies to transform how they work and how they engineer their products. Transformations are notoriously hard to execute, yet many companies have succeeded. What can we learn from them? Can we produce a blueprint for a transformation? This presentation will cover several distinct approaches that companies take to achieve transformation. Each approach utilizes different levers and comes with its own advantages, tradeoffs, costs, risks, a...