lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <>
Subject [jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory consuming index
Date Sun, 18 Mar 2007 15:49:09 GMT


Karl Wettin commented on LUCENE-550:

> Nicolas Lalevée [18/Mar/07 02:04 AM]

> This a very interesting benchmark graph ! Note that there is just a little mistake in
there : the labels of the axes are switched. 

The test is sort of crued, a set of queries with variable complexity that for each iteration
is placed on a new IndexSearcher and IndexReader. Index is optimized at all measure points.

> And you said that you still have lot of gain with 250 000 documents because
> retreiving cost. But if I have to made the choice of having everything in memory, 
> I won't put the data of my own model into Lucene. I will keep them in memory
> while not transforming them into stored Lucene >Document. I will just transform 
> them for indexing purpose and just keep an ID in the Lucene store which will 
> help me map the search result to my own model data. This will avoid the 
> transformation Lucene-Document -> MyModel-Data.

I can only agree.

>(after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document

That is the hypothesis. I've actually been a bit baffled by the results I've seen the last
days while benchmarking. 

The application this was orginially built for (the one with 250 000 documents) is fairly busy,
on average one query every 10ms 24/7. Peeks at one every 2ms. On the single machine setup
with 4GB and Solaris the CPU went from 90% busy to 90% idle when switching from RAMDirectory
to InstantiatedIndex. I can at this point not say if this is due to bad use of Lucene and
compensating for that with a crazy solution. But I don't think so. I think I've missed a bunch
of benchmark factors.

Since that project, and that was some time ago, I have not implemented any applications with
a "normal" corpus using InstantiatedIndex. 

It is the backbone of the active cache (also availabe in this patch). I'm sure people made
similar things with MemoryIndex. For each batch of new documents inserted, I apply cached
queries on the batch-index to detect if the new data would affect the results associated with
the cached query. (The cache does other active things too.)

In the didyoumean issue I use InstantiatedIndex as a speedy a priori index, a small index
with feature selected text (common user queries known to be correct, very common phrases in
document titles, et c) that is used to build ngrams for token suggestions, build phrase suggestions,
rearrange term order in phrases, et c. As these documents are very small (a small phrase)
it is some 10x-20x faster than a RAMDirectory at 50 000 documents.

> InstantiatedIndex - faster but memory consuming index
> -----------------------------------------------------
>                 Key: LUCENE-550
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.0.0
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: HitCollectionBench.jpg, lucene-550.jpg,, trunk.diff.bz2,
trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2,
trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2
> An non file centrinc all in memory index. Consumes some 2x the memory of a RAMDirectory
(in a term satured index) but is between 3x-60x faster depending on application and how one
counts. Average query is about 8x faster. IndexWriter and IndexModifier have been realized
in InterfaceIndexWriter and InterfaceIndexModifier. 
> InstantiatedIndex is wrapped in a new top layer index facade (class Index) that comes
with factory methods for writers, readers and searchers for unison index handeling. There
are decorators with notification handling that can be used for automatically syncronizing
searchers on updates, et.c. 
> Index also comes with FS/RAMDirectory implementation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message