hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: Using Hadoop for Record storage
Date Sun, 15 Apr 2007 19:08:22 GMT
On Fri, Apr 13, 2007 at 01:11:18PM -0400, Andy Liu wrote:
>
>All 3 benchmarks were performed under same conditions.  The 2 Lucene
>benchmarks were performed on separate days, so I don't think the buffer
>cache would've kept the index in memory, although I must admit that I'm
>quite ignorant of how Linux buffer caches really work.
>

In my previous life the uninformed way of achieving something like this was to mmap a file
whose size was greater than available ram, write some 0s to the entire file, sync it and exit.
ymmv.

However this post piqued my curiosity and apparently if you have a kernel newer than 2.6.16.*
you coud try this:
http://aplawrence.com/Linux/buffer_cache.html

hth,
Arun

>Andy
>On 4/13/07, Doug Cutting <cutting@apache.org> wrote:
>
>>How big was your benchmark?  For micro-benchmarks, CPU time will
>>dominate.  For random access to collections larger than memory, disk
>>seeks should dominate.  If you're interested in the latter case, then
>>you should benchmark this: build a database substantially larger than
>>the memory on your machine, and access it randomly for a while.
>>
>>Doug
>>
>>Andy Liu wrote:
>>> I ran a quick benchmark between Hadoop MapFile and Lucene's stored
>>fields.
>>> Using String keys, Hadoop was faster than Lucene, since in Lucene this
>>> requires a TermQuery before the document data can be accessed.  However,
>>> using Lucene's internal ID's, pulling up the data is orders of magnitude
>>> faster than MapFile.  Looking at the code, it makes sense why: MapFile
>>uses
>>> a binary search on sorted keys to locate the data offsets, while
>>Lucene's
>>> internal ID's simply point to an offset in an index file that points to
>>the
>>> data offset in the .fdt file.  I'm assuming in terms of accessing random
>>> records, it just doesn't get any faster than this.
>>>
>>> My application doesn't require any incremental updates, so I'm
>>considering
>>> using Lucene's FSDirectory/IndexOutput/IndexInput to write out
>>serialized
>>> records in the similar way Lucene handles stored fields.  The only
>>drawback
>>> is that I'll have to lookup the records using the internal ID's.  I'm
>>> looking at BDB as well, since there's no limitation to what type of keys
>>I
>>> can use to look up the records.  Thanks for your help.
>>>
>>> Andy
>>>
>>> On 4/12/07, Doug Cutting <cutting@apache.org> wrote:
>>>>
>>>> Andy Liu wrote:
>>>> > I'm exploring the possibility of using the Hadoop records framework
>>to
>>>> > store
>>>> > these document records on disk.  Here are my questions:
>>>> >
>>>> > 1. Is this a good application of the Hadoop records framework,
>>keeping
>>>> in
>>>> > mind that my goals are speed and scalability?  I'm assuming the
>>answer
>>>> is
>>>> > yes, especially considering Nutch uses the same approach
>>>>
>>>> For read-only access, performance should be decent.  However Hadoop's
>>>> file structures do not permit incremental updates.  Rather they are
>>>> primarily designed for batch operations, like MapReduce outputs.  If
>>you
>>>> need to incrementally update your data, then you might look at
>>something
>>>> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
>>>> designed to be a much more scalable, incrementally updateable DB than
>>>> BDB or relational DBs, but its implementation is not yet complete.)
>>>>
>>>> Doug
>>>>
>>>
>>

Mime
View raw message