hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PIG-614) reduce io during sharing scans of the same input datasets
Date Wed, 21 Jan 2009 00:49:59 GMT

     [ https://issues.apache.org/jira/browse/PIG-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich resolved PIG-614.
--------------------------------

    Resolution: Duplicate

This issue will be addressed by https://issues.apache.org/jira/browse/PIG-627


> reduce io during sharing scans of the same input datasets 
> ----------------------------------------------------------
>
>                 Key: PIG-614
>                 URL: https://issues.apache.org/jira/browse/PIG-614
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Samuel Guo
>            Priority: Minor
>             Fix For: types_branch
>
>
> If we want to store different results that generated from the same input dataset, now
we need to write two or several *STORE* clauses. And these *STORE* clauses will be translated
to different mr jobs despite of these mr jobs may share scans of the same input datasets.
> for example:
> Dataset 'weather' contains the records of the weather. Each record contains three part
: wind/air/tempreture. we need to process different part of the records.
> we may write a pig script as below:
> weather = load 'weather.txt' as (wind, air, tempreture);
> wind_results = ... wind ...;
> air_results = ...air...;
> temp_results = ...tempreture...;
> store wind_results into 'wind.results';
> store air_results into 'air.results';
> store temp_results into 'temp.results';
> now pig will translate this script into three different MR jobs wich run sequencely:
scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again,
process the air data, store the air results; ... 
> if the input data set is large, it is not efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message