lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <>
Subject Re: use Lucene to index sentences
Date Mon, 06 Feb 2006 22:50:46 GMT
Hi Marc,
Thanks for your suggestions. Marking sentences in documents and using span
query is a good approach.  How do you compare its performance to a database
approach? For example, sentences can be stored in mysql, one sentence per
row, and they can be searched by mysql's full text search feature. Using
database, it will be also easy to tell which document the matched sentence
belongs to.


On 2/6/06, Marc Hadfield <> wrote:
> Hi AJ -
> Depending on your need, you could create a lucene document for each
> sentence (in which case searching and returning sentences is trivial),
> or create a lucene document for each of your documents, with embedded
> sentence start/stop markers (as a special symbol).  or, instead of a
> special symbol, you can increase the token count after each
> end-of-sentence so that there is a large gap inbetween sentences -- this
> will give higher scores to intra-sentence matches.
> if you insert special sentence marker symbols, then you could use a span
> search to guarantee that a phrase happens inside a sentence.  when a
> match occurs, you can use the document's termpositionvector object to
> re-create the original sentence, or alternatively, use the embedded
> sentence number in lucene (perhaps symbols like "__sentence_start" and
> "__sentence_num_20") to grab the original sentence from a file
> containing the full text with sentence markers (perhaps xml tags:
> "<sentence num=20>").
> I use the techniques such as the above for a very large lucene index of
> documents with embedded sentence markers.  There are various trade-offs
> in terms of index size (how much info to keep in index), expected query
> performance, and so on.
> ---marc hadfield
> AJ Chen wrote:
> >I'll appreciate any advice on whether Lucene is appropriate for
> index/search
> >sentences.  I have millions of documents broken down into millions of
> >sentences. Each sentence does not exist as a document.  All these
> sentences
> >are in a small number of big files. How can I use Lucene to index/search
> the
> >sentences? Search will return which sentence matches the query.  If
> Lucene
> >does not do it, any better approach besides using mysql database?
> >
> >Thanks,
> >AJ
> >
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message