lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Problem searching in the same sentence
Date Thu, 16 Sep 2010 17:57:23 GMT
Hi Sirish,

see my comments inline...

On Thu, Sep 16, 2010 at 7:39 PM, Sirish Vadala <sirishreddy@gmail.com> wrote:
>
> Hello All:
>
> Can any one suggest me the best way to allow me to perform a sentence
> specific phrase search?
>
> Eg: Let the indexed text be:
>
> If you are posting a question, please try search first. Your question may
> have already been answered. Don't post repeatedly. Wait for a few days.
> People will read your post by email.
>
> Now if I search for the phrase 'post repeatedly Wait for a few', still I am
> able to retrieve the document even though they are in different sentences.
>
> Currently I am using StandardAnalyzer and this is how I am generating lucene
> documents:
>
> Field field = new Field(fieldName, validFieldValue, Field.Store.YES,
> Field.Index.ANALYZED);
> document.add(field);
>
> Is there a way to keep track of different sentences while indexing the
> content.
What you essentially need to do is you need to tell lucene where the
sentence ends are while you are indexing your text. PhraseQuery uses
positional information to retrieve phrase matches and those positional
information is created by your TokenStream (you get from the
analyzer). The TokenStream sets the PositionIncrementAttribute for
each term with an according delta between the current term and the
previous term. Yet, if you index "hello world. Here am I" the
posIncrement between "world" and "here" will be 1 since
StandardTokenizer will throw away the punctuation. what you
essentially need to do is to introduce a larger posIncrement at
sentence borders so that PhraseQuery does not consider "world" and
"here" to be a phrase.

You can either write your own Tokenizer which is the more advance
version or you can simply add multiple fields to you document one for
each sentence. If you do that you need to set the  position increment
gap which is done by subclassing Analyzer and override
Analyzer#getPositionIncrementGap() to return 100 or something like
that.
The document you need to build would then look like the following pseudocode:

doc = Document()
doc.addField(Field("foo", "hello world.",...)
doc.addField(Field("foo", "here am I", ...)

I hope that helps

simon




There are several possibilities to do what you want and maybe the
easiest would be to split you
> Any hint would be appreciated.
> Thanks.
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-tp1501269p1501269.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message