hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: MapFile performance
Date Mon, 03 Aug 2009 08:57:21 GMT
On Mon, Aug 3, 2009 at 3:09 AM, Billy
Pearson<billy_pearson@sbcglobal.net> wrote:
>
>
> not sure if its still there but there was a parm in the hadoop-site conf
> file that would allow you to skip x number if index when reading it in to
> memory.

This is io.map.index.skip (default 0), which will skip this number of
keys for every key in the index. For example, if set to 2, one third
of the keys will end up in memory.

> From what I understand we scan find the key offset just before the data and
> seek once and read until we find the key.
>
> Billy
>
>
> ----- Original Message ----- From: "Andy Liu"
> <andyliu1227-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
> To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org>
> Sent: Tuesday, July 28, 2009 7:53 AM
> Subject: MapFile performance
>
>
>> I have a bunch of Map/Reduce jobs that process documents and writes the
>> results out to a few MapFiles.  These MapFiles are subsequently searched
>> in
>> an interactive application.
>>
>> One problem I'm running into is that if the values in the MapFile data
>> file
>> are fairly large, lookup can be slow.  This is because the MapFile index
>> only stores every 128th key by default (io.map.index.interval), and after
>> the binary search it may have to scan/skip through up to 127 values (off
>> of
>> disk) before it finds the matching record.  I've tried
>> io.map.index.interval
>> = 1, which brings average get() times from 1200ms to 200ms, but at the
>> cost
>> of memory during runtime, which is undesirable.
>>
>> One possible solution is to have the MapFile index store every single
>> <key,
>> offset> pair.  Then MapFile.Reader, upon startup, would read every 128th
>> key
>> in memory.  MapFile.Reader.get() would behave the same way except instead
>> of
>> seeking through the values SequenceFile it would seek through the index
>> SequenceFile until it finds the matching record, and then it can seek to
>> the
>> corresponding offset in the values.  I'm going off the assumption that
>> it's
>> much faster to scan through the index (small keys) than it is to scan
>> through the values (large values).
>>
>> Or maybe the index can be some kind of disk-based btree or bdb-like
>> implementation?
>>
>> Anybody encounter this problem before?
>>
>> Andy
>>
>
>
>

Mime
View raw message