jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Is doc addition / indexing synchronous or asynchronous?
Date Wed, 05 Apr 2006 17:23:20 GMT
thomasg wrote:
> I am wondering how Jackrabbit handles document additions and indexing. I am
> adding documents (.txt, .doc, .pdf) to a repository using nt:resource nodes.

it's complicated ;)

> I assume that calling session.save() causes new documents to be added to the
> data storage and then indexed by Lucene.

yes, in general that's correct. however the jackrabbit index does some 
buffering for added/modified/deleted items. changes to the index are not 
persisted immediately on save().
some preparations are done already on save:
- create the lucene document that is later added to the index
- run text filter if node is a jcr resource

the size of the buffer can be configured using the 'bufferSize' 
parameter. See also [1].

there are tree conditions that cause the buffer to be flushed:
1) buffer is full
2) a query is executed
3) volatileIdleTime is reached, see also [1]

in case of 1) the buffer is flushed by the thread that is currently 
modifying the index. that means it is possible that a session.save() 
that just stores a single property will have to store much more index 
changes than just the single property. similarly if there are items in 
the buffer and a thread executes a query 2), that thread will first has 
to push all the pending changes to the index before it can execute the 
query.
in case of 3) there is a background thread that commits the pending 
changes to the index.

> Is addition / indexing synchronous
> or asynchronous, i.e, does addition and indexing have to complete before the
> method returns, or are these tasks handed to another thread?

it's a mix of both. see above.

> When debugging
> tests I seem to get quite long delays on session.save() with large documents
> (around 1 min for 50MB). Also can the synch / asynch behaviour be modified?

This is because of the text filtering on jcr resources. the filter 
implementations all read the full binary and provide a character stream 
on TextFilter.doFilter(). For larger documents this is not ideal. a text 
filter implementation should rather provide a text representation on a 
lazy basis, e.g. when the stream is actually consumed.

We already had a discussion about this in the past [3], but with proper 
transaction support for versioning that issue kind of lost its severeness.

The pdfbox library (used by the pdf text filter) is also known to be 
very slow. Anyone knows an alternative open source library?

In the long term, it might make sense to implement a truly async 
indexing for resource.
e.g. use the current indexing behaviour but exclude text filtering, and 
then let a background thread do the filtering and update the index 
whenever it is done. But then the downside is that you do not have a 
guarantee anymore when the document is searchable.

> Any enlightenment in this area would be appreciated, as would any pointers
> to useful documentation in these areas. 

the currently available documentation is in the sample configuration 
file [1] and on the jackrabbit website [2] (however that page is a bit 
off-topic).


regards
  marcel


[1] 
http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit/src/main/config/repository.xml
[2] http://jackrabbit.apache.org/doc/arch/operate/query.html
[3] http://issues.apache.org/jira/browse/JCR-264

Mime
View raw message