lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manik Surtani <>
Subject Re: A new Lucene Directory available
Date Mon, 16 Nov 2009 10:33:37 GMT
@Sanne, thanks for announcing this, good stuff!

@Earwin, note that this is a tech preview and hardly production-ready code yet.  The more
eyes that scan the code, try it out, report bugs and bottlenecks, the better.  So thanks for
spotting ISPN-276, we look forward to more feedback/patches.  :)  Regarding your comments
regarding locking, cluster-wide syncs, performance and tuning JGroups, I agree with Sanne
that you should post your concerns on and we can talk about
it in greater depth there while keeping things relevant.


Manik Surtani
Lead, Infinispan
Lead, JBoss Cache

On 15 Nov 2009, at 16:11, Sanne Grinovero wrote:

> Hi again Earwin,
> thanks you very much for spotting the byte reading issue, it's
> definitely not as I wanted it.
> I never tried to defend an improved updates/s ratio, just maybe
> compared to scheduled rsyncs :-)
> Our goal is to scale on queries/sec while usage semantics stays
> unchanged, so you can open an IndexWriter as it was local to make
> updates clusterwide. Very useful to cluster the many products already
> using Lucene which are currently implementing exotic index management
> workarounds or shared filesystems, as they weren't designed for it
> from the beginning as SolR did.
> I mentioned JIRA, you noticed how slow it can get on larger
> deployments? because there's no way to deploy it clustered currently
> (besides by using Terracotta), as it relies much on Lucene and index
> changes need to be applied in real time.
> About locking and jgroups.. please switch over to
> so you can get better answers and I
> don't have to spam the Lucene developers.
> Regards,
> Sanne
> On Sun, Nov 15, 2009 at 3:43 PM, Earwin Burrfoot <> wrote:
>>> About the RAMDirectory comparison, as you said yourself the bytes
>>> aren't read constantly but just at index reopen so I wouldn't be too
>>> worried about the "bunch of methods" as they're executed once per
>>> segment loading;
>> The bytes /are/ read constantly (readByte() method). I believe that is
>> the most innermost loop you can hope to find in Lucene.
>>> A RAMDirectory is AFAIK not recommended as you could hit memory limits and because
it's basically a synchronized HashMap;
>> On the other hand, just as I mentioned - the only access to said
>> synchronized HashMap is done when you
>> open InputStream on a file. That, unlike readByte(), happens rarely,
>> as InputStreams are cloned after creation as needed.
>> As for memory limits, your unbounded local cache hits them with same ease.
>>> Instances of ChunkCacheKey are not created for each single byte read
>>> but for each byte[] buffer, being the size of these buffers configurable.
>> No, they are! :-)
>>, rev. 1103:
>> 120           public byte readByte() throws IOException {
>> .........
>> 132              buffer = getChunkFromPosition(cache, fileKey,
>> filePosition, bufferSize);
>> .........
>> 141           }
>> getChunkFromPosition() is called each time readByte() is invoked. It
>> creates 1-2 instances of ChunkCacheKey.
>>> This was decided after observations that it was
>>> improving performance to "chunk" segments in smaller pieces rather
>>> than have huge arrays of bytes, but if you like you can configure it
>>> to degenerate to approach the one key per segment ratio.
>> Locally, it's better not to chunk segments (unless you hit 2Gb
>> barrier). When shuffling them over network - I can't say.
>>> Comparing to a RAMDirectory is unfair, as with InfinispanDirectory I can scale
>> I'm just following two of your initial comparisons. And the only
>> characteristic that can be scaled with such
>> approach is queries/s. Index size - definetly not, updates/s - questionable.
>>> About JGroups I'm not technically prepared for a match, but I've heard
>>> of different stories of much bigger than 20 nodes business critical
>>> clusters working very well. Sure, it won't scale without a proper
>>> configuration at all levels: os, jgroups and infrastructure.
>> The volume of messages travelling around, length of GC delays VS
>> cluster size and messaging mode matter.
>> They used reliable synchronous multicasts, so - once one node starts
>> collecting, all others wait (or worse - send retries).
>> Another one starts collecting, then another, partially delivered
>> messages hold threads - caboom!
>> How is locking handled here? With central broker it probably can work.
>> --
>> Kirill Zakharenko/Кирилл Захаренко (
>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> ICQ: 104465785
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> -- 
> Sanne Grinovero
> Sourcesense - making sense of Open  Source:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message