hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: SequenceFile and streaming
Date Thu, 28 May 2009 14:28:05 GMT
Hi Walter,

On Thu, May 28, 2009 at 6:52 AM, walter steffe <steffe@tiscali.it> wrote:
> Hello
>  I am a new user and I would like to use hadoop streaming with
> SequenceFile in both input and output side.
> -The first difficoulty arises from the lack of a simple tool to generate
> a SequenceFile starting from a set of files in a given directory.
> I would like to have something similar to "tar -cvf file.tar foo/"
> This should work also in the opposite direction like "tar -xvf file.tar"

There's a tool for turning tar files into sequence files here:

> -An other important feature that I would like to see is the possibility
> to feed the mapper stdin with the whole content of a file (extracted
> from the file SequenceFile) disregarding the key.

Have a look at SequenceFileAsTextInputFormat which will do this for
you (except the key is the sequence file's key).

> Using each file as a tar archive I it would like to be able to do:
>  $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
>                  -input "/user/me/inputSequenceFile"  \
>                  -output "/user/me/outputSequenceFile"  \
>                  -inputformat SequenceFile
>                  -outputformat SequenceFile
>                  -mapper myscript.sh
>                  -reducer NONE
>  myscrip.sh should work as a filter which takes its input from
>  stdin and put the output on stdout:
>  tar -x
>  "do something on the generated dir and create an outputfile"
>  cat outputfile
> The output file should (automatically) go into the outputSequenceFile.
> I think that this would be a very usefull schema which fits well with
> the mapreduce requirements on one side and with the unix commands on the
> other side. It should not be too difficoult to implement the tools
> needed for that.

I totally agree - having more tools to better integrate sequence files
and map files with unix tools would be very handy.


> Walter

View raw message