pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Early projection and lazy casting
Date Sun, 04 Dec 2011 16:17:48 GMT
Ah I see, PIG-1324..

On Sun, Dec 4, 2011 at 8:15 AM, Dmitriy Ryaboy <dvryaboy@gmail.com> wrote:
> flatten(lineitem) uses all the fields from lineitem, hence no pruning.
>
> On Fri, Dec 2, 2011 at 6:42 PM, Jie Li <jieli@cs.duke.edu> wrote:
>> Sure. The two lines in bold are just dropping out non-necessary fields.
>> Without them Pig would not project, especially for the table lineitem.
>>
>> lineitem = load '$input/lineitem' USING PigStorage('|') as
>> (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long,
>> l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double,
>> l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray,
>> l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray,
>> l_shipmode:chararray, l_comment:chararray);
>>
>> part = load '$input/part' USING PigStorage('|') as (p_partkey:long,
>> p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray,
>> p_size:long, p_container:chararray, p_retailprice:double,
>> p_comment:chararray);
>>
>> *lineitem = foreach lineitem generate l_partkey, l_quantity,
>> l_extendedprice ;*
>> part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX';
>> *part = foreach part generate p_partkey;*
>>
>> COG1 = COGROUP part by p_partkey, lineitem by l_partkey;
>> COG1 = filter COG1 by COUNT(part) > 0;
>> COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem),
>> 0.2 * AVG(lineitem.l_quantity) as l_avg;
>>
>> COG3 = filter COG2 by l_quantity < l_avg;
>> COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum;
>>
>> G1 = group COG ALL;
>>
>> result = foreach G1 generate SUM(COG.l_sum)/7.0;
>>
>>
>>
>> On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <dvryaboy@gmail.com> wrote:
>>
>>> Can you provide a script that shows projection not happening? We've
>>> observed the opposite (and use that fact extensively)
>>>
>>> D
>>>
>>> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <jieli@cs.duke.edu> wrote:
>>> > Hi all,
>>> >
>>> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields
>>> asap,
>>> > which really affects the performance. Though
>>> >
>>> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid
>>> > that "As part of its optimizations Pig analyzes Pig Latin scripts and
>>> > determines what fields in an input it needs at each step in the script.
>>> It
>>> > uses this information to aggressively drop fields it no longer needs."
>>> >
>>> > We also found that Pig casts the data into the types defined in the
>>> schema,
>>> > which is usually unnecessary, as most of them will be soon dropped.
>>> >
>>> > To work around these, we have to manually drop those fields and remove
>>> the
>>> > types in the schema, which are really not interesting.
>>> >
>>> > Jie
>>>
>>>

Mime
View raw message