lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: Storing and loading the FST directly from disk
Date Sat, 04 Jun 2011 00:11:46 GMT
> IMO the byte[] representation is so compact
> that it doesn't really matter if you use FS Cache memory or JVM memory
> so I'd rather go for jvm memory here too.

In principle I agree, however in practice HBase users will likely have
a lot of keys per region and server.  If we stored every single key,
I'm not sure what the memory requirements would be, I'll probably run
some tests to get an idea of number of keys -> memory size.  I guess a
linear scan would still be fine, eg, the key would be constructed via
the FST (which may not be performant), and the value would be loaded
from the MMap, which would be sequential.

> here you should rather store pointers to another file and mmap that
> file. Keep your FST as lean and compact as possible and make sure its
> in memory. The compression should do a good job for you here!

Right, we'll see.

On Fri, Jun 3, 2011 at 12:00 AM, Simon Willnauer
<simon.willnauer@googlemail.com> wrote:
> I agree with Robert and Dawid that once you go past the page fault
> border you will loose performance. The problem here is that you don't
> realize it immediately. IMO the byte[] representation is so compact
> that it doesn't really matter if you use FS Cache memory or JVM memory
> so I'd rather go for jvm memory here too.
>
> it seems you are having a slightly different usecase here:
>
>> On a related note, is it OK to place lets say 1K byte[]s into the
>> Output portion of the FST?
>
> here you should rather store pointers to another file and mmap that
> file. Keep your FST as lean and compact as possible and make sure its
> in memory. The compression should do a good job for you here!
>
> simon
> On Fri, Jun 3, 2011 at 8:19 AM, Dawid Weiss
> <dawid.weiss@cs.put.poznan.pl> wrote:
>>> In any case I browsed the code a bit more, it'd be easy to test out
>>> the MMap'ing because the FST simply reads directly from a byte[],
>>> which could be converted to a ByteBuffer (that's MMap'd).  Nice!
>>
>> The FST is basically a byte[], but like Robert said -- the layout of
>> transitions and states will require nearly random access patterns all
>> around it. There are some optimizations one can make to lay out
>> initial nodes close to each other, for example (and I've done it as
>> part of another project) -- these do optimize CPU cache utilization
>> and I would assume MMAP'ped buffers as well -- but once you go past
>> page fault zone the performance will be terrible. And I assume the
>> degradation will not be smooth, but very sharp.
>>
>> This said, I'd be interested in looking at the benchmarks if you want
>> to invest some time and do it ;)
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message