hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
Date Tue, 02 Apr 2013 14:37:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13619857#comment-13619857
] 

Ashutosh Chauhan commented on HIVE-3992:
----------------------------------------

Following tests failed on trunk:
* Test org.apache.hadoop.hive.ql.io.TestRCFile FAILED

*     [junit] Begin query: mapjoin_test_outer.q
    [junit] Deleted file:/home/ashutosh/hive/build/ql/test/data/warehouse/dest_1
    [junit] Running: diff -a /home/ashutosh/hive/build/ql/test/logs/clientpositive/mapjoin_test_outer.q.out
/home/ashutosh/hive/ql/src/test/results/clientpositive/mapjoin_test_outer.q.out
    [junit] 414d413
    [junit] <
    [junit] 569a569
    [junit] >
    [junit] 1320d1319
    [junit] <
    [junit] 1475a1475
    [junit] >
    [junit] Exception: Client execution results failed with error code = 1
    [junit] See build/ql/tmp/hive.log, or try "ant test ... -Dtest.silent=false" to get more
logs.
    [junit] Failed query: mapjoin_test_outer.q
                
> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3992
>                 URL: https://issues.apache.org/jira/browse/HIVE-3992
>             Project: Hive
>          Issue Type: Bug
>         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>            Reporter: Gopal V
>            Assignee: Gopal V
>         Attachments: HIVE-3992.patch, select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>       try {
>         seek(position + 4); // skip escape
>         in.readFully(syncCheck);
>         int syncLen = sync.length;
>         for (int i = 0; in.getPos() < end; i++) {
>           int j = 0;
>           for (; j < syncLen; j++) {
>             if (sync[j] != syncCheck[(i + j) % syncLen]) {
>               break;
>             }
>           }
>           if (j == syncLen) {
>             in.seek(in.getPos() - SYNC_SIZE); // position before
>             // sync
>             return;
>           }
>           syncCheck[i % syncLen] = in.readByte();
>         }
>       }
> ...
>     }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a ByteBuffer
via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear search for the
sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit
100).
> This behaviour should be avoided at best or at least replaced by a rolling hash for efficient
comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message