thomasg wrote:
> I am wondering how Jackrabbit handles document additions and indexing. I am
> adding documents (.txt, .doc, .pdf) to a repository using nt:resource nodes.
it's complicated ;)
> I assume that calling session.save() causes new documents to be added to the
> data storage and then indexed by Lucene.
yes, in general that's correct. however the jackrabbit index does some
buffering for added/modified/deleted items. changes to the index are not
persisted immediately on save().
some preparations are done already on save:
- create the lucene document that is later added to the index
- run text filter if node is a jcr resource
the size of the buffer can be configured using the 'bufferSize'
parameter. See also [1].
there are tree conditions that cause the buffer to be flushed:
1) buffer is full
2) a query is executed
3) volatileIdleTime is reached, see also [1]
in case of 1) the buffer is flushed by the thread that is currently
modifying the index. that means it is possible that a session.save()
that just stores a single property will have to store much more index
changes than just the single property. similarly if there are items in
the buffer and a thread executes a query 2), that thread will first has
to push all the pending changes to the index before it can execute the
query.
in case of 3) there is a background thread that commits the pending
changes to the index.
> Is addition / indexing synchronous
> or asynchronous, i.e, does addition and indexing have to complete before the
> method returns, or are these tasks handed to another thread?
it's a mix of both. see above.
> When debugging
> tests I seem to get quite long delays on session.save() with large documents
> (around 1 min for 50MB). Also can the synch / asynch behaviour be modified?
This is because of the text filtering on jcr resources. the filter
implementations all read the full binary and provide a character stream
on TextFilter.doFilter(). For larger documents this is not ideal. a text
filter implementation should rather provide a text representation on a
lazy basis, e.g. when the stream is actually consumed.
We already had a discussion about this in the past [3], but with proper
transaction support for versioning that issue kind of lost its severeness.
The pdfbox library (used by the pdf text filter) is also known to be
very slow. Anyone knows an alternative open source library?
In the long term, it might make sense to implement a truly async
indexing for resource.
e.g. use the current indexing behaviour but exclude text filtering, and
then let a background thread do the filtering and update the index
whenever it is done. But then the downside is that you do not have a
guarantee anymore when the document is searchable.
> Any enlightenment in this area would be appreciated, as would any pointers
> to useful documentation in these areas.
the currently available documentation is in the sample configuration
file [1] and on the jackrabbit website [2] (however that page is a bit
off-topic).
regards
marcel
[1]
http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit/src/main/config/repository.xml
[2] http://jackrabbit.apache.org/doc/arch/operate/query.html
[3] http://issues.apache.org/jira/browse/JCR-264
|