lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mrodent <>
Subject question about using lucene on large documents
Date Tue, 04 Feb 2014 20:53:21 GMT

This question may well be very familiar to experienced Lucene people... in
which case all I need is to be pointed somewhere.  I am new.

If you have a large document, e.g. a large Word file, and you want to split
it into text, e.g. by using Apache POI, what techniques are best used?  

It seems to me that if you split it so that the text of each paragraph
becomes a Document (in the Lucene index sense) then obviously each search
will only be carried out within that para... so maybe you should split it
into blocks of text, i.e. a run of paras where no text-free (white space
only) paras occur.  But supposing those are too big as Documents, or too
small as Documents?

It occurs to me that under some circs you might actually want your Documents
to be "overlapping"... i.e. the text at the end of one Document is also the
text at the beginning of the next Document... thus making it more unlikely
that the index will miss terms which are quite close to one another.

But surely this must be an inefficient way of storing index data (and all
the more so the text "content" itself)... because repetitious.

So then it makes me wonder whether the developers behind Lucene have made
provision for such circs ... is there a way of making the presence of a
search term in Document N influence the ranking of Document N+1 (for example
if another search term is found in the latter)?  Or rather, both Documents,
as a pair, should then be given a ranking, as a pair of Documents.

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message