hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
Date Tue, 02 Apr 2013 00:19:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13619365#comment-13619365
] 

Gopal V commented on HIVE-3992:
-------------------------------

That part of the patch can be dropped if hive created record readers only when the mapper
is processing, but it does create & sync record readers irrespective of whether the reader
is needed or not.

Making hive less eager about that would remove the need for that line, but this patch gave
performance improvements with the least code change instead of a patch higher above that affects
all record reader inits.
                
> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3992
>                 URL: https://issues.apache.org/jira/browse/HIVE-3992
>             Project: Hive
>          Issue Type: Bug
>         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>            Reporter: Gopal V
>            Assignee: Gopal V
>         Attachments: HIVE-3992.patch, select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>       try {
>         seek(position + 4); // skip escape
>         in.readFully(syncCheck);
>         int syncLen = sync.length;
>         for (int i = 0; in.getPos() < end; i++) {
>           int j = 0;
>           for (; j < syncLen; j++) {
>             if (sync[j] != syncCheck[(i + j) % syncLen]) {
>               break;
>             }
>           }
>           if (j == syncLen) {
>             in.seek(in.getPos() - SYNC_SIZE); // position before
>             // sync
>             return;
>           }
>           syncCheck[i % syncLen] = in.readByte();
>         }
>       }
> ...
>     }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a ByteBuffer
via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear search for the
sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit
100).
> This behaviour should be avoided at best or at least replaced by a rolling hash for efficient
comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message