hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
Date Wed, 06 Feb 2013 21:23:14 GMT
Gopal V created HIVE-3992:
-----------------------------

             Summary: Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
                 Key: HIVE-3992
                 URL: https://issues.apache.org/jira/browse/HIVE-3992
             Project: Hive
          Issue Type: Bug
         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
            Reporter: Gopal V


The following function does some bad I/O

{code}
public synchronized void sync(long position) throws IOException {
  ...
      try {
        seek(position + 4); // skip escape
        in.readFully(syncCheck);
        int syncLen = sync.length;
        for (int i = 0; in.getPos() < end; i++) {
          int j = 0;
          for (; j < syncLen; j++) {
            if (sync[j] != syncCheck[(i + j) % syncLen]) {
              break;
            }
          }
          if (j == syncLen) {
            in.seek(in.getPos() - SYNC_SIZE); // position before
            // sync
            return;
          }
          syncCheck[i % syncLen] = in.readByte();
        }
      }
...
    }
{code}

This causes a rather large number of readByte() calls which are passed onto a ByteBuffer via
a single byte array.

This results in rather a large amount of CPU being burnt in a the linear search for the sync
pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit 100).

This behaviour should be avoided at best or at least replaced by a rolling hash for efficient
comparison, since it has a known byte-width of 16 bytes.

Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message