hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Liu" <andyliu1...@gmail.com>
Subject Re: Using Hadoop for Record storage
Date Fri, 13 Apr 2007 17:11:18 GMT
My benchmark was for 5M records, on a 64-bit Opteron with 16gigs of memory.
The Lucene index was about 10G, and the Hadoop recordstore was a few gigs
smaller.

Performing random seeks, the results were:

Hadoop records: 16.84 ms per seek
Lucene records w/TermQuery: 37.17 ms per seek
Lucene records by Lucene document iD: 0.11 ms per see

All 3 benchmarks were performed under same conditions.  The 2 Lucene
benchmarks were performed on separate days, so I don't think the buffer
cache would've kept the index in memory, although I must admit that I'm
quite ignorant of how Linux buffer caches really work.

Andy
On 4/13/07, Doug Cutting <cutting@apache.org> wrote:

> How big was your benchmark?  For micro-benchmarks, CPU time will
> dominate.  For random access to collections larger than memory, disk
> seeks should dominate.  If you're interested in the latter case, then
> you should benchmark this: build a database substantially larger than
> the memory on your machine, and access it randomly for a while.
>
> Doug
>
> Andy Liu wrote:
> > I ran a quick benchmark between Hadoop MapFile and Lucene's stored
> fields.
> > Using String keys, Hadoop was faster than Lucene, since in Lucene this
> > requires a TermQuery before the document data can be accessed.  However,
> > using Lucene's internal ID's, pulling up the data is orders of magnitude
> > faster than MapFile.  Looking at the code, it makes sense why: MapFile
> uses
> > a binary search on sorted keys to locate the data offsets, while
> Lucene's
> > internal ID's simply point to an offset in an index file that points to
> the
> > data offset in the .fdt file.  I'm assuming in terms of accessing random
> > records, it just doesn't get any faster than this.
> >
> > My application doesn't require any incremental updates, so I'm
> considering
> > using Lucene's FSDirectory/IndexOutput/IndexInput to write out
> serialized
> > records in the similar way Lucene handles stored fields.  The only
> drawback
> > is that I'll have to lookup the records using the internal ID's.  I'm
> > looking at BDB as well, since there's no limitation to what type of keys
> I
> > can use to look up the records.  Thanks for your help.
> >
> > Andy
> >
> > On 4/12/07, Doug Cutting <cutting@apache.org> wrote:
> >>
> >> Andy Liu wrote:
> >> > I'm exploring the possibility of using the Hadoop records framework
> to
> >> > store
> >> > these document records on disk.  Here are my questions:
> >> >
> >> > 1. Is this a good application of the Hadoop records framework,
> keeping
> >> in
> >> > mind that my goals are speed and scalability?  I'm assuming the
> answer
> >> is
> >> > yes, especially considering Nutch uses the same approach
> >>
> >> For read-only access, performance should be decent.  However Hadoop's
> >> file structures do not permit incremental updates.  Rather they are
> >> primarily designed for batch operations, like MapReduce outputs.  If
> you
> >> need to incrementally update your data, then you might look at
> something
> >> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
> >> designed to be a much more scalable, incrementally updateable DB than
> >> BDB or relational DBs, but its implementation is not yet complete.)
> >>
> >> Doug
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message