hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-441) SequenceFile should support 'custom compressors'
Date Mon, 28 Aug 2006 11:32:33 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-441?page=comments#action_12430967 ] 
            
Arun C Murthy commented on HADOOP-441:
--------------------------------------

Wrt to the new Compression{Input|Output}Stream interfaces proposed by Owen, here are some
thoughts and alternatives...

(Since then a new idea is to have the the above 'compression streams' implement the Data{Input|Output}
interfaces, so that they can be passed along to the write/readFields methods of Writable objects
i.e. bridge a 'stream' with Data{Input|Output})

a) 

public abstract class CompressionOutputStream extends DataOutputStream {
  public abstract int write();
  public abstract int write(byte[], int, int);
  public abstract void resetCompressionState(); // 'reset'
  public abstract void finish(); // Finishes writing compressed data to the output stream
without closing the underlying stream.
}

Here we let DataOutputStream's other public methods (writeBoolean, writeInt etc.) be as-is,
based on assumption that they all _will_ internally call the two abstract 'write' methods
which correctly 'compress'. (Valid assumption on all jvms and on all platforms across versions?)

Since DataInputStream's 'read' is marked 'final':
public abstract class CompressionInputStream implements DataInput {
  public abstract int read();
  public abstract int read(byte[], int, int);
  public abstract void resetCompressionState(); // 'reset'

  // Other interfaces of DataInput are provided concrete implementations
}


b) 

Same CompressionInputStream but get CompressionOutputStream to implement DataOutput instead
of DataOutputStream to maintain symmetry - this approach has the drawback that we will need
to provide concrete implementations of other public interfaces of DataOutput; only maintaining
symmetry.


c)

 To provide a 'true bridge' between streams and Writables we can create other classes:

public abstract class WritableOutputStream implements DataOutput (extends DataOutputStream)
{
}

public abstract class WritableInputStream implements DataInput {
}

public class CompressionOutputStream extends  WritableOutputStream {
}

public class CompressionInputStream extends DataInput WritableInputStream {
}

 Thus it will provide a more general bridge between streams and Writables and enable other
{In|Out}putStream implementations in future. (This can also be a separate issue...)


Thoughts?

> SequenceFile should support 'custom compressors'
> ------------------------------------------------
>
>                 Key: HADOOP-441
>                 URL: http://issues.apache.org/jira/browse/HADOOP-441
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>         Assigned To: Arun C Murthy
>             Fix For: 0.6.0
>
>
> SequenceFiles should support 'custom compressors' which can be specified by the user
on creation of the file. 
> Readily available packages for gzip and zip (java.util.zip) are among obvious choices
to support. Of course there will be hooks so that other compressors can be added in future
as long as there is a way to construct (input/output) streams on top of the compressor/decompressor.
> The 'classname' of the 'custom compressor/decompressor' could be stored in the header
of the SequenceFile which can then be used by SequenceFile.Reader to figure out the appropriate
'decompressor'. Thus I propose we add constructors to SequenceFile.Writer which take in the
'classname' of the compressor's input/output stream classes (e.g. DeflaterOutputStream/InflaterInputStream
or GZIPOutputStream/GZIPInputStream), which acts as the hook for future compressors/decompressors.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message