hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Liu <andyliu1...@gmail.com>
Subject Re: HDFS - millions of files in one directory?
Date Mon, 26 Jan 2009 20:36:58 GMT
SequenceFile supports transparent block-level compression out of the box, so
you don't have to compress data in your code.

Most the time, compression not only saves disk space but improves
performance because there's less data to write.

Andy

On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner <markkerzner@gmail.com>wrote:

> Doug,
> SequenceFile looks like a perfect candidate to use in my project, but are
> you saying that I better use uncompressed data if I am not interested in
> saving disk space?
>
> Thank you,
> Mark
>
> On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting <cutting@apache.org> wrote:
>
> > Philip (flip) Kromer wrote:
> >
> >> Heretrix <http://en.wikipedia.org/wiki/Heritrix>,
> >> Nutch<http://en.wikipedia.org/wiki/Nutch>,
> >> others use the ARC file format
> >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> >>
> >
> > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > store crawled pages.  The keys of crawl content files are URLs and the
> > values are:
> >
> >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> >
> > I believe that the implementation of this class pre-dates SequenceFile's
> > support for compressed values, so the values are decompressed on demand,
> > which needlessly complicates its implementation and API.  It's basically
> a
> > Writable that stores binary content plus headers, typically an HTTP
> > response.
> >
> > Doug
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message