lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hadfield <>
Subject Re: use Lucene to index sentences
Date Mon, 06 Feb 2006 21:39:21 GMT
Hi AJ -

Depending on your need, you could create a lucene document for each 
sentence (in which case searching and returning sentences is trivial), 
or create a lucene document for each of your documents, with embedded 
sentence start/stop markers (as a special symbol).  or, instead of a 
special symbol, you can increase the token count after each 
end-of-sentence so that there is a large gap inbetween sentences -- this 
will give higher scores to intra-sentence matches.

if you insert special sentence marker symbols, then you could use a span 
search to guarantee that a phrase happens inside a sentence.  when a 
match occurs, you can use the document's termpositionvector object to 
re-create the original sentence, or alternatively, use the embedded 
sentence number in lucene (perhaps symbols like "__sentence_start" and 
"__sentence_num_20") to grab the original sentence from a file 
containing the full text with sentence markers (perhaps xml tags: 
"<sentence num=20>").

I use the techniques such as the above for a very large lucene index of 
documents with embedded sentence markers.  There are various trade-offs 
in terms of index size (how much info to keep in index), expected query 
performance, and so on.

---marc hadfield

AJ Chen wrote:

>I'll appreciate any advice on whether Lucene is appropriate for index/search
>sentences.  I have millions of documents broken down into millions of
>sentences. Each sentence does not exist as a document.  All these sentences
>are in a small number of big files. How can I use Lucene to index/search the
>sentences? Search will return which sentence matches the query.  If Lucene
>does not do it, any better approach besides using mysql database?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message