lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <>
Subject Re: InstantiatedIndex questions
Date Thu, 08 Oct 2009 13:04:46 GMT
On Tue, Oct 06, 2009 at 07:51:44PM +0200, Karl Wettin wrote:
> 6 okt 2009 kl. 18.54 skrev David Causse:
> David, your timing couldn't be better. Just the other day I proposed  
> that we deprecate InstantiatedIndexWriter. The sum of the reasons to  
> this is that I'm a bit lazy. Your mail makes me reconsider.

Well, so you intended to make InstantiatedIndex impossible to use from
scratch. What is very nice with the current implementation is that it
conform to the "normal" lucene usage : index with IW and query with IR.
It make it very easy to adopt your implementation with legacy
If it is immutable it's like String, I have to deal with
StringBuffer/StringBuilder, that's nice in some ways but it has some
drawbacks, it's maybe why all efficient internal analysis API in lucene
permits the use of char[].

So in the lucene world with II, RAMDirectory will be my StringBuilder,
so I'll have to wrap InstantiatedIndex to use a RAMDirectory as a
buffer. Like this :

private IndexWriter buffer;
private readerIsValid = false;
private InstanciatedIndexReader reader;

public IIWrapper(Analyzer a) {
	buffer = new IndexWriter(new RAMDirectory(), a, MaxFieldLength.UNLIMITED);

public void /* writeLock */ addDocument(Document doc) {
	readerIsValid = false;

public IndexReader /* readLock */ getReader() {
	if(!readerIsValid) {
		/* writeLock */
		reader = new InstantiadedIndex(buffer.getReader()).indexReaderFactory();
		readerIsvalid = true;
	return reader;

So I'll have best indexation time but the first query will suffer the
IIR creation. It is maybe better than use IIW. But IMHO (as a user point
of view) it's very convenient to have the choice of multiple store in
lucene, and it's pretty cool to use them in the same way. If you take a look
at MemoryIndex, description is very attractive but it's too far (again
IMHO) from the lucene API.

> > On the index time InstantiatedIndex is behind RAMDirectory, but the  
> > time
> Would you mind benchmarking some for me using your corpora? The issue  
> suggests that people use the InstantiatedIndex(IndexReader) constructor 
> to create the index rather than using InstantiatedIndexWriter. Is it way 
> slower for you to produce the index using RAMDirectory/IndexWriter and 
> pass an IndexReader to InstantiatedIndex?
> This is what the package level javadocs says about  
> InstantiatedIndexWriter:
> "Hardly any effort has been put in to optimizing the  
> InstantiatedIndexWriter, only minimizing the amount of time needed to  
> write-lock the index has been considered."
> I'm sure there are ways to speed it up, I just never managed to find the 
> time to look in to it. I never really used IIW.

We've processed huge number of documents and didn't see any problems.
But we don't use all the possibilities lucene stores have to offer.

> It might be worth mentioning that when InstantiatedIndex#commit returns 
> it has yeilded an optimized "single segment" index. This is not quite how 
> a Directory/IndexWriter acts.

When I have some times I'll try to bench our system without IIW.

> > gained over queries make it better (for what I see it can be 2 times
> > faster).
> >
> > InstantiatedIndex will be our default volatile mini index store for  
> > our
> > next production release.
> Very cool!!
> > Whe should have other needs of this index but the lack of addIndexes
> > support make it impossible for us to use it in other situations. So we
> > continue to use RAMDirectory in such situations.
> Have you considered using multiple InstantiatedIndex and a MultiReader? 
> That would pretty much be the same thing, just that the store wouldn't be 
> quite as optimized. It would definitly use more RAM than if it was the 
> same index. You could of course also pass this MultiReader to a new 
> InstantiatedIndex. I have no real clue about the difference in speed and 
> RAM consuption between these solutions so you should benchmark all 
> solutions.

Well, in fact with 2.9 there is awesome number of solutions for us.  We
might try the new getReader() on IW, we didn't use MultiReader but it
seems to be a very convenient solution also. I have to take time to think
about it.

Just a side note for other users who may think about using
InstantiadedIndex, we've seen that our process that use II can be as I
said 2 times faster. You have to consider that it is exactly if I said,
"Hey I switched from MySQL to PostgreSQL and my app is 2 times faster". 

> > Do you think we could reach RAMDirectory index time by tweaking some  
> > initialCap
> > stuff inside java.util.Collections you use?
> Maybe. But I think it would be a relatively small gain. But don't take  
> my words for granted, benchmark it.
> Using the InstantiatedIndex(IndexReader) constructor will create rather 
> optimal size of the collections.
> As for InstantiatedIndexWriter I think it's pretty much only the  
> transient collections in #commit that will help you, my guess is that  
> you should expemient with the dirtyTerms and termsByText attributes.  
> Count the number of terms in your complete index and see how much it  
> speeds thing up by creating the collections with this size from the  
> start.

Well I had a quick look to IIW and your collections size are already near our

> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

David Causse

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message