pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
Date Wed, 07 Apr 2010 01:28:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854292#action_12854292

Ashutosh Chauhan commented on PIG-1348:

Since this is mostly performance related, there are few more things which we can get in depending
on complexity - speedup tradeoff:
1) PigLineRecordWriter#write() is synchronized. Is that needed? I don't see a scenario where
multiple threads are writing using same object and thus potentially stomping on each other.
Am I missing something here?
2) Within write() I think it can be safely assumed that value is of type Tuple, because argument
in putNext() is of type Tuple. Then we can get rid of instanceof.
3) In StorageUtil.putField(), is it possible to get rid of DataType.findType(), possibly by
getting hold of schema and getting type information from there. If not, then may be we cache
the type info first time, instead of finding it on every call. At the very least, we shall
get rid of casts for simple types as thats unnecessary. DataType.isComplex() can be used to
determine that. 

> PigStorage making unnecessary byte array copy when storing data
> ---------------------------------------------------------------
>                 Key: PIG-1348
>                 URL: https://issues.apache.org/jira/browse/PIG-1348
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>         Attachments: PIG-1348.patch
> InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory().
It then uses 10%(by default, though configurable) of this memory and divides this memory into
number of bags. It keeps track of the memory used by bags and then proactively spills if bags
memory usage reach close to these limits. Given all this in theory when presented with data
more then it can handle InternalCachedBag should not run out of memory. But in practice we
find OOM happening. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message