hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Using Hadoop for Record storage
Date Fri, 13 Apr 2007 16:14:00 GMT
How big was your benchmark?  For micro-benchmarks, CPU time will 
dominate.  For random access to collections larger than memory, disk 
seeks should dominate.  If you're interested in the latter case, then 
you should benchmark this: build a database substantially larger than 
the memory on your machine, and access it randomly for a while.


Andy Liu wrote:
> I ran a quick benchmark between Hadoop MapFile and Lucene's stored fields.
> Using String keys, Hadoop was faster than Lucene, since in Lucene this
> requires a TermQuery before the document data can be accessed.  However,
> using Lucene's internal ID's, pulling up the data is orders of magnitude
> faster than MapFile.  Looking at the code, it makes sense why: MapFile uses
> a binary search on sorted keys to locate the data offsets, while Lucene's
> internal ID's simply point to an offset in an index file that points to the
> data offset in the .fdt file.  I'm assuming in terms of accessing random
> records, it just doesn't get any faster than this.
> My application doesn't require any incremental updates, so I'm considering
> using Lucene's FSDirectory/IndexOutput/IndexInput to write out serialized
> records in the similar way Lucene handles stored fields.  The only drawback
> is that I'll have to lookup the records using the internal ID's.  I'm
> looking at BDB as well, since there's no limitation to what type of keys I
> can use to look up the records.  Thanks for your help.
> Andy
> On 4/12/07, Doug Cutting <cutting@apache.org> wrote:
>> Andy Liu wrote:
>> > I'm exploring the possibility of using the Hadoop records framework to
>> > store
>> > these document records on disk.  Here are my questions:
>> >
>> > 1. Is this a good application of the Hadoop records framework, keeping
>> in
>> > mind that my goals are speed and scalability?  I'm assuming the answer
>> is
>> > yes, especially considering Nutch uses the same approach
>> For read-only access, performance should be decent.  However Hadoop's
>> file structures do not permit incremental updates.  Rather they are
>> primarily designed for batch operations, like MapReduce outputs.  If you
>> need to incrementally update your data, then you might look at something
>> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
>> designed to be a much more scalable, incrementally updateable DB than
>> BDB or relational DBs, but its implementation is not yet complete.)
>> Doug

View raw message