hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Seekable interface and CompressInputStream question
Date Sat, 22 Dec 2012 06:29:14 GMT
Seekable interface isn't used to detect for splittable compressed
files. I've also not seen it be implemented properly in any of the
codecs in trunk at least today (with Bzip2, being the only natively
splittable one, too not implementing a seek function). I don't think
we support seeking yet on a compressed input stream (AFAIK).

Any FileInputFormat derivative will consider splitting a provided file
path iff its isSplitable(…) method returns true. [1]

In its default implementation, a CompressionCodec is considered to
support splittable decompression iff it implements the interface
SplittableCompressionCodec. [2]

For each input path added to MR, before we try to split it, we check
if this path is a compressed file, and if it is splittable, using the
above calls and the condition, and only if it returns true we go ahead
and split. This is how we handle mixed paths properly.

[1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#isSplitable(org.apache.hadoop.mapreduce.JobContext,%20org.apache.hadoop.fs.Path)
[2] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/SplittableCompressionCodec.html

On Sat, Dec 22, 2012 at 7:43 AM, java8964 java8964 <java8964@hotmail.com> wrote:
> Hi,
> I have a question related to Seekable interface. Right now I am using the
> CDH3 release, with hadoop 0.20.2. I understand in it, the
> CompressInputStream will throw UnsupportedException in methods inherited
> from Seekable interface, as they are not implemented.
> My question is that does Seekable mean the underline InputStream will
> support Split? As if an InputStream can be seekable, then it should be able
> to split, right?
> If so, in the future release, I assume that CompressInputStream will
> implement Seekable in hadoop. But my understand is that some compression can
> be split, some cannot. If the data file is gzip file, and let's say that I
> get a CompressInputStream does support Seekable, with codec of Gzip codec, I
> will assume it is Splitable, but in fact it isn't. How do I write a generic
> InputFormat to support both splitable/unsplitable compress input stream in
> this case? Or my understanding is not correct, that Seekable and Split are
> totally different things?
> Thanks
> Yong

Harsh J

View raw message