lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: question about using lucene on large documents
Date Wed, 05 Feb 2014 00:11:56 GMT
Ideally you would chunk a document at logical boundaries that will make 
sense as units of both search and presentation.  For some content, these 
boundaries don't align; for example you might want to search for matches 
within a paragraph scope, or within a section, chapter, or part of a 
book, but often books break down neatly into a sequence of more-or-less 
self-contained units (usu. bigger than paragraphs, though: think chapters).

If you need to be concerned about overlapping scopes, I would create a 
nested dolls container structure so you can choose which level to search 
at and to display, maintaining links between the documents so you can 
navigate or re-assemble it later.  Don't be afraid of the inefficiency 
if you need it, but don't create it if you don't, because it will 
complexify your life.

Basically - there is no single right answer; it depends on the content 
and the use cases.


On 2/4/2014 3:53 PM, mrodent wrote:
> Hi,
> This question may well be very familiar to experienced Lucene people... in
> which case all I need is to be pointed somewhere.  I am new.
> If you have a large document, e.g. a large Word file, and you want to split
> it into text, e.g. by using Apache POI, what techniques are best used?
> It seems to me that if you split it so that the text of each paragraph
> becomes a Document (in the Lucene index sense) then obviously each search
> will only be carried out within that para... so maybe you should split it
> into blocks of text, i.e. a run of paras where no text-free (white space
> only) paras occur.  But supposing those are too big as Documents, or too
> small as Documents?
> It occurs to me that under some circs you might actually want your Documents
> to be "overlapping"... i.e. the text at the end of one Document is also the
> text at the beginning of the next Document... thus making it more unlikely
> that the index will miss terms which are quite close to one another.
> But surely this must be an inefficient way of storing index data (and all
> the more so the text "content" itself)... because repetitious.
> So then it makes me wonder whether the developers behind Lucene have made
> provision for such circs ... is there a way of making the presence of a
> search term in Document N influence the ranking of Document N+1 (for example
> if another search term is found in the latter)?  Or rather, both Documents,
> as a pair, should then be given a ranking, as a pair of Documents.
> --
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message