hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-54) SequenceFile should compress blocks, not individual entries
Date Fri, 18 Aug 2006 18:17:17 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-54?page=all ]

Doug Cutting updated HADOOP-54:

    Status: Open  (was: Patch Available)

I think this is nearly ready.

A minor improvement: the typesafe enumeration instances should probably have a toString()
method, to facilitate debugging.

Running the TestSequenceFile unit test caused my 515MB Ubuntu box to swap horribly and it
didn't complete.  I grabbed a stack trace and saw:

    [junit]     at java.util.zip.Inflater.init(Native Method)
    [junit]     at java.util.zip.Inflater.<init>(Inflater.java:75)
    [junit]     at java.util.zip.Inflater.<init>(Inflater.java:82)
    [junit]     at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:231)
    [junit]     at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:227)
    [junit]     at org.apache.hadoop.io.SequenceFile$Reader.createValueBytes(SequenceFile.java:1195)
    [junit]     at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:1459)
    [junit]     at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:1413)
    [junit]     at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1386)
    [junit]     at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1406)
    [junit]     at org.apache.hadoop.io.TestSequenceFile.sortTest(TestSequenceFile.java:178)

Since sorting should not do any inflating, the Inflater should probably not be created in
this case.  So maybe we should lazily initialize this field?

More generally, before we commit this we should ensure that performance is comparable to what
it was before.  Creating a new ValueBytes wrapper per entry processed when sorting looks expensive
to me, but this may in fact be insignificant.  If it is significant, then we might replace
the ValueBytes API with a compressor API, where the bytes to be compressed are passed explicitly.

> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>                 Key: HADOOP-54
>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>         Assigned To: Arun C Murthy
>             Fix For: 0.6.0
>         Attachments: SequenceFile.updated.final.patch, SequenceFiles.final.patch, SequenceFiles.patch,
SequenceFilesII.patch, VIntCompressionResults.txt
> SequenceFile will optionally compress individual values.  But both compression and performance
would be much better if sequences of keys and values are compressed together.  Sync marks
should only be placed between blocks.  This will require some changes to MapFile too, so that
all file positions stored there are the positions of blocks, not entries within blocks.  Probably
this can be accomplished by adding a getBlockStartPosition() method to SequenceFile.Writer.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message