hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered
Date Tue, 06 Jan 2009 01:11:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Gates updated PIG-599:
---------------------------

    Attachment: loadperf.patch

This patch changes BufferedPositionedInputStream to wrap a BufferedInputStream around the
provided InputStream.  It also adds a new constructor for DefaultTuple (and new calls in TupleFactory)
that take an ArrayList<Object> and use that directly to construct the DefaultTuple instead
of copying the list (as was done previously).  In a run of the pig mix queries these changes
made most queries about 25-40% faster.

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This
is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.
 So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written
without buffering.  But the FSDataInputStream is far enough down the stack that it is still
quite costly to call read() individually for each byte.  A run through a profiler shows that
a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message