hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
Date Mon, 01 Apr 2013 22:45:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13619267#comment-13619267
] 

Ashutosh Chauhan commented on HIVE-3992:
----------------------------------------

I have a question related to 1) Does not do sync if ExecMapper.getDone() == true
Outside of Hive (meaning using RCFile independent of Hive), getDone() will always be false,
so this if block will always be skipped and part of this optimization will not fire. That
is alright, but want to make sure this will not result in wrong results for that scenario.
[~gopalv] Is that correct?

                
> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3992
>                 URL: https://issues.apache.org/jira/browse/HIVE-3992
>             Project: Hive
>          Issue Type: Bug
>         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>            Reporter: Gopal V
>            Assignee: Gopal V
>         Attachments: HIVE-3992.patch, select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>       try {
>         seek(position + 4); // skip escape
>         in.readFully(syncCheck);
>         int syncLen = sync.length;
>         for (int i = 0; in.getPos() < end; i++) {
>           int j = 0;
>           for (; j < syncLen; j++) {
>             if (sync[j] != syncCheck[(i + j) % syncLen]) {
>               break;
>             }
>           }
>           if (j == syncLen) {
>             in.seek(in.getPos() - SYNC_SIZE); // position before
>             // sync
>             return;
>           }
>           syncCheck[i % syncLen] = in.readByte();
>         }
>       }
> ...
>     }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a ByteBuffer
via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear search for the
sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit
100).
> This behaviour should be avoided at best or at least replaced by a rolling hash for efficient
comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message