hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: SequenceFiles and binary data
Date Mon, 15 Sep 2008 16:59:57 GMT

On Sep 14, 2008, at 7:15 PM, John Howland wrote:

> If I want to read values out of input files as binary data, is this
> what BytesWritable is for?


> I've successfully run my first task that uses a SequenceFile for
> output. Are there any examples of SequenceFile usage out there? I'd
> like to see the full range of what SequenceFile can do.

If you want serious usage, I'd suggest pulling up Nutch. Distcp also  
uses sequence files as its input.

You should also probably look at the TFile package that Hong is writing.


Once it is ready, it will likely be exactly what you are looking for.

> What are the
> trade-offs between record compression and block compression?

You pretty much always want block compression. The only place where  
record compression is ok, is if your value is web pages or some other  
huge chunk of text.

> What are
> the limits on the key and value sizes?

Large.  I think I've see keys and/or values of around 50-100mb. It  
certainly can't be bigger than 1g. I believe the TFile limit on keys  
may be 64k.

> How do you use the per-file
> metadata?

It is just an application specific string to string map in the header  
of the file.

> My intended use is to read files on a local filesystem into a
> SequenceFile, with the value of each record being the contents of each
> file. I hacked MultiFileWordCount to get the basic concept working...

You should also look at the Hadoop archives.

> but I'd appreciate any advice from the experts. In particular, what's
> the most efficient way to read data from an
> InputStreamReader/BufferedReader into a BytesWritable object?

The easiest way is the way you've done it. You probably want to use  
lzo compression too.

-- Owen

View raw message