hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: SequenceFile and streaming
Date Sat, 30 May 2009 00:57:23 GMT
Well, I don't know much about the tar tool at all.  But bz2 is a VERY slow
compression scheme (though quite fascinating to read about how it works).  A
plain tar, or tar.gz will be faster if it is supported.

On 5/28/09 10:10 PM, "walter steffe" <steffe@tiscali.it> wrote:

> Hi Tom,
>   i have seen the tar-to-seq tool but the person who made it says it is
> very slow:
> "It took about an hour and a half to convert a 615MB tar.bz2 file to an
> 868MB sequence file". To me it is not acceptable.
> Normally to generate a tar file from 615MB od data it take s less then
> one minute. And, in my view the generatin of a sequence file should be
> even simper. You have just to append files and headers without worring
> about hierarchy.
> Regarding the SequenceFileAsTextInputFormat I am not sure it will do the
> job I am looking for.
> The hadoop documentation says: SequenceFileAsTextInputFormat generates
> SequenceFileAsTextRecordReader which converts the input keys and values
> to their String forms by calling toString() method.
> Let we suppose that the keys and values were generated using tar-to-seq
> on a tar archive. Each value is a bytearray that stores the content of a
> file which can be any kind of data (in example a jpeg picture). It
> doesn't make sense to convert this data into a string.
> What is needed is a tool to simply extract the file as with
> tar -xf archive.tar filename. The hadoop framework can be used to
> extract a Java class and you have to do that within a java program. The
> streaming package is meant to be used in a unix shell without the need
> of java programming. But I think it is not very usefull if the
> sequencefile (which is the principal data structure of hadoop) is not
> accessible from a shell command.
> Walter

View raw message