hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: HDFS - millions of files in one directory?
Date Mon, 26 Jan 2009 17:53:35 GMT
We like compression if the data is readily compressible and large as it
saves on IO time.


On Mon, Jan 26, 2009 at 9:35 AM, Mark Kerzner <markkerzner@gmail.com> wrote:

> Doug,
> SequenceFile looks like a perfect candidate to use in my project, but are
> you saying that I better use uncompressed data if I am not interested in
> saving disk space?
>
> Thank you,
> Mark
>
> On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting <cutting@apache.org> wrote:
>
> > Philip (flip) Kromer wrote:
> >
> >> Heretrix <http://en.wikipedia.org/wiki/Heritrix>,
> >> Nutch<http://en.wikipedia.org/wiki/Nutch>,
> >> others use the ARC file format
> >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> >>
> >
> > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > store crawled pages.  The keys of crawl content files are URLs and the
> > values are:
> >
> >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> >
> > I believe that the implementation of this class pre-dates SequenceFile's
> > support for compressed values, so the values are decompressed on demand,
> > which needlessly complicates its implementation and API.  It's basically
> a
> > Writable that stores binary content plus headers, typically an HTTP
> > response.
> >
> > Doug
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message