lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] On Transactionality and Performance
Date Wed, 23 Mar 2011 05:45:43 GMT
On Tue, Mar 22, 2011 at 10:00:29PM -0400, David E. Wheeler wrote:
> I ended up rewriting the PGXN schema into multiple schemas after consulting
> with Graham Barr on how CPAN search works. 

> I'm pretty happy with the results so far, but have a few questions about how
> indexing transactions work.

Cool.  :)

> * Why does `commit()` invalidate an Indexer object?

Because that's how it's always worked.  :)  

It doesn't have to be that way, but it would take effort to change it.

> * Should I be making as many changes to an index as I can before calling
>   `commit()`, or can I update bits at a time using separate index objects?

Making as many changes as you can within one Indexer session is more efficient.

You might think of an Indexer's lifespan as a single transaction.

Indexers require exlusive write locks on the index -- so you can't have more
than one operating at once.

> * Is there a way to invalidate an IndexSearcher object when an index
>   changes?  Or do I just need to create a new searcher for every request? 

>From IndexSearcher's docs:

  IndexSearchers operate against a single point-in-time view or Snapshot of
  the index.  If an index is modified, a new IndexSearcher must be opened to
  access the changes.

>   If the latter, how efficient is the constructor?

Pretty efficient: tens of milliseconds, even for big indexes with lots of sort
caches.  

Opening an index involves some directory traversals, some slurping and parsing
of JSON files, some mmaping.  However, we exploit the system IO cache
aggressively and mmap files rather than read large amounts of index data into
process RAM a la Lucene.

> But I'm starting to suspect this isn't the best way to do it with Lucy/KinoSearch. Is
it better to:
> 
> * Update all 1,000 objects in a single transaction (one indexer, calling commit() at
the end)?

That's the fastest way.

> * Always create a new IndexSearcher for new requests in order to see any
>   changes? (I found in tests I was writing that if I updated an index, an
>   existing IndexSearcher did *not* see the change -- maybe it was caching
>   results for performance?)

It's not for performance per se -- it's to provide coherence across multiple
method calls.

We don't cache results in a traditional sense.  However, searching against an
index naturally warms the system IO cache where the index data lives.  When
you open a new IndexSearcher, you typically benefit from what the OS has
cached for you.

Marvin Humphrey


Mime
View raw message