lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <s0454...@sms.ed.ac.uk>
Subject Re: Sentence and Paragraph searching
Date Fri, 01 Jul 2005 19:13:00 GMT
Quoting Peter Laurinc <laurinc@felisconsulting.com>:

> Hi,
>
> I'm newbie to lucene.
> I wan to ask, how to implement search for phrase that must be in
> sentence/paragraph.
> I did see som examples, that uses term position changing, but I think
> that this is not the way, because it breaks classic proximity search.
> (if one word is on end and second of begining of next sentence)

Most NLP toolkits have a sentence and paragraph boundary detector that can be
used to separate a single document into its constituent sentences and
paragraphs. The two NLP toolkits I am familiar with are OpenNLP (Open Source)
and Alias-i's LingPipe (Commercial) libraries. These toolkits can be used in
several ways to achieve what you want.

If you are ONLY interested in searching for individual sentences, then you can
use the toolkits to create an index of sentences instead of an index of
documents.

Alternatively, you can encode the sentence boundaries found by these toolkits
within the documents you are indexing, for example using special characters or
as a separate field in Lucene. After every search, do an extra check to ensure
that Lucene did not match across sentence boundaries.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message