hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
Date Thu, 01 Oct 2009 01:15:23 GMT

    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761070#action_12761070
] 

Dmitriy V. Ryaboy commented on PIG-984:
---------------------------------------

Good idea.

It should be straightforward to look at the sort info associated with the ResourceSchema (see
the load/store proposal) to know whether the data is sorted; this frees us from relying on
loaders, lets us follow ORDER BYs and LIMITs, etc.

Still, this is not quite safe unless you know that the distribution key is a subset of your
group key.  A simple sorted input stream can still be split among mappers with some rows with
the same key going to one, and some to the other.  Do you have thoughts on how to handle such
cases?

This is something that can be inferred looking at the schema and distribution key. I understand
wanting a manual handle to turn on the behavior while developing, but the production version
of this can be done automatically ( "if distributed by and sorted on a subset of group keys,
apply map-side group" rule in the optimizer).

> PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-984
>                 URL: https://issues.apache.org/jira/browse/PIG-984
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Richard Ding
>
> The general group by operation in Pig needs both mappers and reducers (the aggregation
is done in reducers). This incurs disk writes/reads  between mappers and reducers.
> However, in the cases where the input data has the following properties
>    1. The records with the same key are grouped together (such as the data is sorted
by the keys).
>    2. The records with the same key are in the same mapper input.
> the group by operation can be performed in the mappers only and thus remove the overhead
of disk writes/reads.
> Alan proposed adding a hint to the group by clause like this one:
> {code}
> A = load 'input' using SomeLoader(...);
> B = group A by $0 using "mapside";
> C = foreach B generate ...
> {code}
> The proposed addition of using "mapside" to group will be a mapside group operator that
collects all records for a given key into a buffer. When it sees a key change it will emit
the key and bag for records it had buffered. It will assume that all keys for a given record
are collected together and thus there is not need to buffer across keys. 
> It is expected that "SomeLoader" will be implemented by data systems such as Zebra to
ensure the data emitted by the loader satisfies the above properties (1) and (2).
> It will be the responsibility of the user (or the loader) to guarantee these properties
(1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't
check for the errors in the input data.
> For the group by clauses with mapside hint, Pig Latin will only support group by columns
(including *), not group by expressions nor group all. 
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message