hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <billy_pear...@sbcglobal.net>
Subject Re: MapFile performance
Date Mon, 03 Aug 2009 02:09:21 GMT

not sure if its still there but there was a parm in the hadoop-site conf 
file that would allow you to skip x number if index when reading it in to 
>From what I understand we scan find the key offset just before the data and 
seek once and read until we find the key.


----- Original Message ----- 
From: "Andy Liu" <andyliu1227-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org>
Sent: Tuesday, July 28, 2009 7:53 AM
Subject: MapFile performance

>I have a bunch of Map/Reduce jobs that process documents and writes the
> results out to a few MapFiles.  These MapFiles are subsequently searched 
> in
> an interactive application.
> One problem I'm running into is that if the values in the MapFile data 
> file
> are fairly large, lookup can be slow.  This is because the MapFile index
> only stores every 128th key by default (io.map.index.interval), and after
> the binary search it may have to scan/skip through up to 127 values (off 
> of
> disk) before it finds the matching record.  I've tried 
> io.map.index.interval
> = 1, which brings average get() times from 1200ms to 200ms, but at the 
> cost
> of memory during runtime, which is undesirable.
> One possible solution is to have the MapFile index store every single 
> <key,
> offset> pair.  Then MapFile.Reader, upon startup, would read every 128th 
> key
> in memory.  MapFile.Reader.get() would behave the same way except instead 
> of
> seeking through the values SequenceFile it would seek through the index
> SequenceFile until it finds the matching record, and then it can seek to 
> the
> corresponding offset in the values.  I'm going off the assumption that 
> it's
> much faster to scan through the index (small keys) than it is to scan
> through the values (large values).
> Or maybe the index can be some kind of disk-based btree or bdb-like
> implementation?
> Anybody encounter this problem before?
> Andy

View raw message