accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Marion (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-2353) Test improvments to java.io.InputStream.seek() for possible Hadoop patch
Date Tue, 11 Feb 2014 23:56:19 GMT
Dave Marion created ACCUMULO-2353:
-------------------------------------

             Summary: Test improvments to java.io.InputStream.seek() for possible Hadoop patch
                 Key: ACCUMULO-2353
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2353
             Project: Accumulo
          Issue Type: Task
         Environment: Java 6 update 45 or later
Hadoop 2.2.0
            Reporter: Dave Marion
            Priority: Minor


At some point (early Java 7 I think, then backported to around Java 6 Update 45), the java.io.InputStream.seek()
method was changed from reading byte[512] to byte[2048]. The difference can be seen in DeflaterInputStream,
which has not been updated:

{noformat}
    public long skip(long n) throws IOException {
        if (n < 0) {
            throw new IllegalArgumentException("negative skip length");
        }
        ensureOpen();

        // Skip bytes by repeatedly decompressing small blocks
        if (rbuf.length < 512)
            rbuf = new byte[512];

        int total = (int)Math.min(n, Integer.MAX_VALUE);
        long cnt = 0;
        while (total > 0) {
            // Read a small block of uncompressed bytes
            int len = read(rbuf, 0, (total <= rbuf.length ? total : rbuf.length));

            if (len < 0) {
                break;
            }
            cnt += len;
            total -= len;
        }
        return cnt;
    }
{noformat}

and java.io.InputStream in Java 6 Update 45:

{noformat}
    // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to
    // use when skipping.
    private static final int MAX_SKIP_BUFFER_SIZE = 2048;

    public long skip(long n) throws IOException {

	long remaining = n;
	int nr;

	if (n <= 0) {
	    return 0;
	}
	
	int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining);
	byte[] skipBuffer = new byte[size];

	while (remaining > 0) {
	    nr = read(skipBuffer, 0, (int)Math.min(size, remaining));
	    
	    if (nr < 0) {
		break;
	    }
	    remaining -= nr;
	}
	
	return n - remaining;
    }
{noformat}

In sample tests I saw about a 20% improvement in skip() when seeking towards the end of a
locally cached compressed file. Looking at the DecompressorStream in HDFS, the seek method
is a near copy of the old InputStream method:

{noformat}
  private byte[] skipBytes = new byte[512];
  @Override
  public long skip(long n) throws IOException {
    // Sanity checks
    if (n < 0) {
      throw new IllegalArgumentException("negative skip length");
    }
    checkStream();
    
    // Read 'n' bytes
    int skipped = 0;
    while (skipped < n) {
      int len = Math.min(((int)n - skipped), skipBytes.length);
      len = read(skipBytes, 0, len);
      if (len == -1) {
        eof = true;
        break;
      }
      skipped += len;
    }
    return skipped;
  }
{noformat}

This task is to evaluate the changes to DecompressorStream with a possible patch to HDFS and
possible bug request to Oracle to port the InputStream.seek changes to DeflaterInputStream.seek



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message