hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: Using Hadoop for Record storage
Date Sun, 15 Apr 2007 19:08:22 GMT
On Fri, Apr 13, 2007 at 01:11:18PM -0400, Andy Liu wrote:
>All 3 benchmarks were performed under same conditions.  The 2 Lucene
>benchmarks were performed on separate days, so I don't think the buffer
>cache would've kept the index in memory, although I must admit that I'm
>quite ignorant of how Linux buffer caches really work.

In my previous life the uninformed way of achieving something like this was to mmap a file
whose size was greater than available ram, write some 0s to the entire file, sync it and exit.

However this post piqued my curiosity and apparently if you have a kernel newer than 2.6.16.*
you coud try this:


>On 4/13/07, Doug Cutting <cutting@apache.org> wrote:
>>How big was your benchmark?  For micro-benchmarks, CPU time will
>>dominate.  For random access to collections larger than memory, disk
>>seeks should dominate.  If you're interested in the latter case, then
>>you should benchmark this: build a database substantially larger than
>>the memory on your machine, and access it randomly for a while.
>>Andy Liu wrote:
>>> I ran a quick benchmark between Hadoop MapFile and Lucene's stored
>>> Using String keys, Hadoop was faster than Lucene, since in Lucene this
>>> requires a TermQuery before the document data can be accessed.  However,
>>> using Lucene's internal ID's, pulling up the data is orders of magnitude
>>> faster than MapFile.  Looking at the code, it makes sense why: MapFile
>>> a binary search on sorted keys to locate the data offsets, while
>>> internal ID's simply point to an offset in an index file that points to
>>> data offset in the .fdt file.  I'm assuming in terms of accessing random
>>> records, it just doesn't get any faster than this.
>>> My application doesn't require any incremental updates, so I'm
>>> using Lucene's FSDirectory/IndexOutput/IndexInput to write out
>>> records in the similar way Lucene handles stored fields.  The only
>>> is that I'll have to lookup the records using the internal ID's.  I'm
>>> looking at BDB as well, since there's no limitation to what type of keys
>>> can use to look up the records.  Thanks for your help.
>>> Andy
>>> On 4/12/07, Doug Cutting <cutting@apache.org> wrote:
>>>> Andy Liu wrote:
>>>> > I'm exploring the possibility of using the Hadoop records framework
>>>> > store
>>>> > these document records on disk.  Here are my questions:
>>>> >
>>>> > 1. Is this a good application of the Hadoop records framework,
>>>> in
>>>> > mind that my goals are speed and scalability?  I'm assuming the
>>>> is
>>>> > yes, especially considering Nutch uses the same approach
>>>> For read-only access, performance should be decent.  However Hadoop's
>>>> file structures do not permit incremental updates.  Rather they are
>>>> primarily designed for batch operations, like MapReduce outputs.  If
>>>> need to incrementally update your data, then you might look at
>>>> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
>>>> designed to be a much more scalable, incrementally updateable DB than
>>>> BDB or relational DBs, but its implementation is not yet complete.)
>>>> Doug

View raw message