jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nelson Takashi Omori <nelson.om...@murah.com.br>
Subject Rebuilding index
Date Thu, 22 Nov 2012 18:53:50 GMT
Hi All,

I'm using Jackrabbit 2.4.3 and my repository has approximately 110 
thousand nodes. From these, about 10 thousand nodes has binary values, 
wich the content need to be extracted, using Tika, and indexed in Lucene.

I decided to delete the index to make Jackrabbit create them again. The 
problem is the time that this operation is taking. I waited for 3 hours 
and the repository wasn't initialized (I don't know exactly how long it 
take to complete the repository initialization, because I stopped the 
process). Disabling Tika's text extraction, it took 5 minutes, so I 
concluded that the problem is the time that Tika takes to extract the 10 
thousand documents.

If the index become inconsistent and I have to execute the rebuild, my 
client doesn't want to wait for more than 3 hours to start using the 
system. So I'm planning to create a subclass of 
org.apache.jackrabbit.core.query.lucene.SearchIndex and try to modify 
how the indexes are re-created. To give to my client a fast access to 
the repository, first I'll ignore the text extraction and create the 
index with normal properties. With this structure, I can give access to 
the repository to my client and he can do many things using only the 
normal properties. So, in background, I'll start the text extraction of 
each document and update Lucene's document with extracted value.

I have some questions about it.
1) Reading the source code, jackrabbit is using LazyTextExtractorField 
(and other classes) to execute the extraction in a separate thread. 
Doesn't it do exactly what I want? But, even so I waited 3 hours and the 
repository wasn't initialized and ready to use. Is it normal?
2)  What I'm planning to do is the best approach? Did anybody make 
something similar?



View raw message