incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David E. Wheeler" <da...@kineticode.com>
Subject Re: [lucy-dev] On Transactionality and Performance
Date Thu, 24 Mar 2011 01:01:11 GMT
On Mar 23, 2011, at 1:45 AM, Marvin Humphrey wrote:

> Making as many changes as you can within one Indexer session is more efficient.
> 
> You might think of an Indexer's lifespan as a single transaction.
> 
> Indexers require exlusive write locks on the index -- so you can't have more
> than one operating at once.

Oh. Does that cause problems during a big update? The index is then unavailable for querying,
yes? The indexing I'm doing can be rather slow (unpacking distribution packages, parsing lot
of JSON and HTML, filtering the stuff for KinoSearch and final documentation output, etc.).

>> * Is there a way to invalidate an IndexSearcher object when an index
>>  changes?  Or do I just need to create a new searcher for every request? 
> 
> From IndexSearcher's docs:
> 
>  IndexSearchers operate against a single point-in-time view or Snapshot of
>  the index.  If an index is modified, a new IndexSearcher must be opened to
>  access the changes.

Gotcha.

> Pretty efficient: tens of milliseconds, even for big indexes with lots of sort
> caches.  
> 
> Opening an index involves some directory traversals, some slurping and parsing
> of JSON files, some mmaping.  However, we exploit the system IO cache
> aggressively and mmap files rather than read large amounts of index data into
> process RAM a la Lucene.

Okay.

>> But I'm starting to suspect this isn't the best way to do it with Lucy/KinoSearch.
Is it better to:
>> 
>> * Update all 1,000 objects in a single transaction (one indexer, calling commit()
at the end)?
> 
> That's the fastest way.

But only if there is no latency from other activities that are part of the indexing, right?
I mean, I don't think I want any long transactions, since the index isn't available for reading,
Yes?

> It's not for performance per se -- it's to provide coherence across multiple
> method calls.
> 
> We don't cache results in a traditional sense.  However, searching against an
> index naturally warms the system IO cache where the index data lives.  When
> you open a new IndexSearcher, you typically benefit from what the OS has
> cached for you.

Okay. So how expensive is it, really, to create a new indexer for each distribution I index,
rather than for all those being indexed in a session? Or is there READ access for searchers
while an indexer is indexing stuff?

Thanks,

David



Mime
View raw message