SYS-CON MEDIA Authors: Pat Romanski, Gary Arora, Zakia Bouachraoui, Yeshim Deniz, Liz McMillan

Blog Feed Post

Big Data and Analytics -- Apache Pig

Big Data Analytics Introduction to Pig

1    Apache Pig

In a previous post, we started talking about big data and analytics, and we started with Hadoop installation. in this post we will cover Apache Pig, how to install and configure it, and then illustrates two use cases (Airline, and Movies data sets)

1.1    Overview

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject).
In this section, we will walk through the Pig framework, its installation, and configuration and will apply some use cases

1.2    Tools and Versions

-          Apache Pig 0.13.0
-          Ubuntu 14.04 LTS
-          Java 1.7.0_65 (java-7-openjdk-amd64)
-          Hadoop 2.5.1

1.3    Installation and Configurations

1-      Download Pig tar file using:
              wget http://apache.mirrors.pair.com/pig/latest/pig-0.13.0.tar.gz

2-      Untar the file and move its contents to /usr/local/pig
               tar -xzvf pig-0.13.0.tar.gz
               mv pig-013.0/ /usr/local/pig
3-      Edit the bashrc file and add pig to the system path


4-      Now, we can check Pig help command to get list of available commands:
               pig –help
the following will be displayed:




5-      And to start Pig Grunt shell we run the pig command:
             Pig





1.4    Pig Running Modes:

Pig has two run modes or exectypes:
Ø  Local Mode: To run Pig in local mode. And this can be executed using the following command:

              $pig -x local
              e.g. $pig –x local test.pig

Ø  Mapreduce Mode: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. And this can be started by the following:
$ pig
Or
$ pig -x mapreduce
e.g. $ pig test.pig
or
$ pig -x mapreduce test.pig

Using either mode, we can run the Grunt shell, Pig scripts, or embedded programs.

1.5    Use Case 1 (Airline DataSet)

In this use case, we used the airline dataset exist at this location: http://stat-computing.org/dataexpo/2009/the-data.html
In particular the flight data related to year 1987
And here is the Pig Script used to load the dataset file and get the total miles travelled:

records = LOAD '1987.csv' USING PigStorage(',') AS
                (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);

milage_recs = GROUP records ALL;

tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);

STORE tot_miles INTO '/home/mohamed/totalmiles';
And here is a screen shot of running this in local mode:



Pig Local mode executed for the year 1987 using the dataset 1987.csv  and the output of the total miles was written after the Pig script execution to the following file:part-r-00000
The results: 775009272


And here are some screen shots for running the same example using MapReduce mode:

But first, we need to copy the airline dataset to the HDFS directory for Pig to process:

hdfs dfs -copyFromLocal /home/mohamed/1987.csv /user/mohamed/1987.csv
















1.6    Use Case 2 (Movies DataSet)

In this case I used the movies dataset downloaded from this location:
In the Grunt shell we enter the following commands:
        grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
        grunt> DUMP movies;
The output will look like the following:



       grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

       The DUMP of movies_greater_than_four will result the following:



And we can also try the following:
             grunt> sample_10_percent = sample movies 0.1;
             grunt> dump sample_10_percent;
This sample keyword is used to get sample set from data (in this case 10%)



We can also run different other commands over this dataset, like ORDER BY, DISTINCT, GROUP etc.

One last command that I tried is ILUSTRATE, this command is used to view the step-by-step execution of a sequence of statements; hers is an example:

          grunt> ILLUSTRATE movies_greater_than_four;





1.7    Issues and problems:


Ø  The path to the pig script output file after the STORE element should be string quoted not like what mentioned in the Book without quotes, as the Pig Latin compiler complains about that.

Ø  Pig grunt is not working even I ran the export path command, but it worked well when I added the path to bashrc file and sourced it.

      We reached the end of our second post on big data and analytics, hope you enjoyed reading it and experimenting with Pig installation and configuration. next post will be about Apache HBase.




Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

Latest Stories
While a hybrid cloud can ease that transition, designing and deploy that hybrid cloud still offers challenges for organizations concerned about lack of available cloud skillsets within their organization. Managed service providers offer a unique opportunity to fill those gaps and get organizations of all sizes on a hybrid cloud that meets their comfort level, while delivering enhanced benefits for cost, efficiency, agility, mobility, and elasticity.
Isomorphic Software is the global leader in high-end, web-based business applications. We develop, market, and support the SmartClient & Smart GWT HTML5/Ajax platform, combining the productivity and performance of traditional desktop software with the simplicity and reach of the open web. With staff in 10 timezones, Isomorphic provides a global network of services related to our technology, with offerings ranging from turnkey application development to SLA-backed enterprise support. Leadin...
DevOps has long focused on reinventing the SDLC (e.g. with CI/CD, ARA, pipeline automation etc.), while reinvention of IT Ops has lagged. However, new approaches like Site Reliability Engineering, Observability, Containerization, Operations Analytics, and ML/AI are driving a resurgence of IT Ops. In this session our expert panel will focus on how these new ideas are [putting the Ops back in DevOps orbringing modern IT Ops to DevOps].
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
Enterprises are striving to become digital businesses for differentiated innovation and customer-centricity. Traditionally, they focused on digitizing processes and paper workflow. To be a disruptor and compete against new players, they need to gain insight into business data and innovate at scale. Cloud and cognitive technologies can help them leverage hidden data in SAP/ERP systems to fuel their businesses to accelerate digital transformation success.
Most organizations are awash today in data and IT systems, yet they're still struggling mightily to use these invaluable assets to meet the rising demand for new digital solutions and customer experiences that drive innovation and growth. What's lacking are potent and effective ways to rapidly combine together on-premises IT and the numerous commercial clouds that the average organization has in place today into effective new business solutions.
Concerns about security, downtime and latency, budgets, and general unfamiliarity with cloud technologies continue to create hesitation for many organizations that truly need to be developing a cloud strategy. Hybrid cloud solutions are helping to elevate those concerns by enabling the combination or orchestration of two or more platforms, including on-premise infrastructure, private clouds and/or third-party, public cloud services. This gives organizations more comfort to begin their digital tr...
Keeping an application running at scale can be a daunting task. When do you need to add more capacity? Larger databases? Additional servers? These questions get harder as the complexity of your application grows. Microservice based architectures and cloud-based dynamic infrastructures are technologies that help you keep your application running with high availability, even during times of extreme scaling. But real cloud success, at scale, requires much more than a basic lift-and-shift migrati...
David Friend is the co-founder and CEO of Wasabi, the hot cloud storage company that delivers fast, low-cost, and reliable cloud storage. Prior to Wasabi, David co-founded Carbonite, one of the world's leading cloud backup companies. A successful tech entrepreneur for more than 30 years, David got his start at ARP Instruments, a manufacturer of synthesizers for rock bands, where he worked with leading musicians of the day like Stevie Wonder, Pete Townsend of The Who, and Led Zeppelin. David has ...
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Addteq is a leader in providing business solutions to Enterprise clients. Addteq has been in the business for more than 10 years. Through the use of DevOps automation, Addteq strives on creating innovative solutions to solve business processes. Clients depend on Addteq to modernize the software delivery process by providing Atlassian solutions, create custom add-ons, conduct training, offer hosting, perform DevOps services, and provide overall support services.
Contino is a global technical consultancy that helps highly-regulated enterprises transform faster, modernizing their way of working through DevOps and cloud computing. They focus on building capability and assisting our clients to in-source strategic technology capability so they get to market quickly and build their own innovation engine.
When applications are hosted on servers, they produce immense quantities of logging data. Quality engineers should verify that apps are producing log data that is existent, correct, consumable, and complete. Otherwise, apps in production are not easily monitored, have issues that are difficult to detect, and cannot be corrected quickly. Tom Chavez presents the four steps that quality engineers should include in every test plan for apps that produce log output or other machine data. Learn the ste...
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...