hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Liu" <andyliu1...@gmail.com>
Subject Re: Using Hadoop for Record storage
Date Fri, 13 Apr 2007 14:23:29 GMT
I ran a quick benchmark between Hadoop MapFile and Lucene's stored fields.
Using String keys, Hadoop was faster than Lucene, since in Lucene this
requires a TermQuery before the document data can be accessed.  However,
using Lucene's internal ID's, pulling up the data is orders of magnitude
faster than MapFile.  Looking at the code, it makes sense why: MapFile uses
a binary search on sorted keys to locate the data offsets, while Lucene's
internal ID's simply point to an offset in an index file that points to the
data offset in the .fdt file.  I'm assuming in terms of accessing random
records, it just doesn't get any faster than this.

My application doesn't require any incremental updates, so I'm considering
using Lucene's FSDirectory/IndexOutput/IndexInput to write out serialized
records in the similar way Lucene handles stored fields.  The only drawback
is that I'll have to lookup the records using the internal ID's.  I'm
looking at BDB as well, since there's no limitation to what type of keys I
can use to look up the records.  Thanks for your help.


On 4/12/07, Doug Cutting <cutting@apache.org> wrote:
> Andy Liu wrote:
> > I'm exploring the possibility of using the Hadoop records framework to
> > store
> > these document records on disk.  Here are my questions:
> >
> > 1. Is this a good application of the Hadoop records framework, keeping
> in
> > mind that my goals are speed and scalability?  I'm assuming the answer
> is
> > yes, especially considering Nutch uses the same approach
> For read-only access, performance should be decent.  However Hadoop's
> file structures do not permit incremental updates.  Rather they are
> primarily designed for batch operations, like MapReduce outputs.  If you
> need to incrementally update your data, then you might look at something
> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
> designed to be a much more scalable, incrementally updateable DB than
> BDB or relational DBs, but its implementation is not yet complete.)
> Doug

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message