pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
Date Wed, 07 Apr 2010 18:56:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854643#action_12854643

Ashutosh Chauhan commented on PIG-1348:

1) As far as I can see TextOutputFormat has synchronized write() because it is meant to work
even with mappers implementing MultithreadedMapRunner. But since thats not the case for Pig,
we can get rid of it especially now that we are putting in our own PigTextOutputFormat instead
of using TextOutputformat. 

3) Thats what I meant, if Schema is available, we should use that to find types, instead of
reflecting on every call. I suggested the work around of caching for the case if we know user
did provide Schema, but we dont have a handle on it. Clearly, if there is no schema, we need
to find type every time. I can see that dealing with Complex types even when there is a schema
is not straight forward. In any case, casts that are currently there for simple types are

For performance numbers, both of these will save CPU time, if we are convinced that we are
always I/O bound we can leave these things as it is. 

> PigStorage making unnecessary byte array copy when storing data
> ---------------------------------------------------------------
>                 Key: PIG-1348
>                 URL: https://issues.apache.org/jira/browse/PIG-1348
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>         Attachments: PIG-1348.patch, PIG-1348_2.patch
> InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory().
It then uses 10%(by default, though configurable) of this memory and divides this memory into
number of bags. It keeps track of the memory used by bags and then proactively spills if bags
memory usage reach close to these limits. Given all this in theory when presented with data
more then it can handle InternalCachedBag should not run out of memory. But in practice we
find OOM happening. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message