pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Reed (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-96) It should be possible to spill big databags to HDFS
Date Thu, 07 Feb 2008 15:30:08 GMT

    [ https://issues.apache.org/jira/browse/PIG-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566637#action_12566637

Benjamin Reed commented on PIG-96:

The bags we are spilling need to be processed on a single machine. The really big bag that
represents a relation is already in HDFS and spread across machines. (I would really like
to use different term for bags that represent a relation versus bags that represent a group
of tuples inside of another tuple to avoid confusion in these kinds of discussions.) If the
bag is being processed by an algebraic function, we have already applied disjoint subset paralellism,
so the only thing left is to spill to disk. Since it must be processed locally, we want to
keep it local and not put it on HDFS. The spill is also extremely temporary in nature, since
the bag will be processed locally and then thrown away. 

> It should be possible to spill big databags to HDFS
> ---------------------------------------------------
>                 Key: PIG-96
>                 URL: https://issues.apache.org/jira/browse/PIG-96
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>            Reporter: Pi Song
> Currently databags only get spilled to local disk which costs  2  disk io operations.If
databags are too big, this is not efficient. 
> We should take advantage of HDFS so if the databag is too big (determined by DataBag.getMemorySize()
>  a big  threshold), let's spill it to HDFS. Also read from HDFS in parallel when data
is required.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message