hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-96) It should be possible to spill big databags to HDFS
Date Wed, 06 Feb 2008 17:03:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566200#action_12566200
] 

Alan Gates commented on PIG-96:
-------------------------------

DataBags are spilled only when they are too large for memory, so and individual spill file
isn't more than a few G.  All the spill files together could be larger, so we could open one
HDFS spill file and keep appending.  But this won't work in the sorted or distinct case. 
For the DefaultDataBag case we read the various spill files back serially anyway, so whether
they are on one disk or many doesn't matter.  The only case where writing to HDFS would help
us here is the case where the the total bag exceeds the size of the local disk of the machine.

> It should be possible to spill big databags to HDFS
> ---------------------------------------------------
>
>                 Key: PIG-96
>                 URL: https://issues.apache.org/jira/browse/PIG-96
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>            Reporter: Pi Song
>
> Currently databags only get spilled to local disk which costs  2  disk io operations.If
databags are too big, this is not efficient. 
> We should take advantage of HDFS so if the databag is too big (determined by DataBag.getMemorySize()
>  a big  threshold), let's spill it to HDFS. Also read from HDFS in parallel when data
is required.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message