hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-614) reduce io during sharing scans of the same input datasets
Date Mon, 12 Jan 2009 07:30:59 GMT
reduce io during sharing scans of the same input datasets 

                 Key: PIG-614
                 URL: https://issues.apache.org/jira/browse/PIG-614
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: types_branch
            Reporter: Samuel Guo
            Priority: Minor
             Fix For: types_branch

If we want to store different results that generated from the same input dataset, now we need
to write two or several *STORE* clauses. And these *STORE* clauses will be translated to different
mr jobs despite of these mr jobs may share scans of the same input datasets.

for example:
Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture.
we need to process different part of the records.
we may write a pig script as below:

weather = load 'weather.txt' as (wind, air, tempreture);
wind_results = ... wind ...;
air_results = ...air...;
temp_results = ...tempreture...;
store wind_results into 'wind.results';
store air_results into 'air.results';
store temp_results into 'temp.results';

now pig will translate this script into three different MR jobs wich run sequencely: scan
'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process
the air data, store the air results; ... 

if the input data set is large, it is not efficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message