lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asbjørn A. Fellinghaug <asbj...@fellinghaug.com>
Subject Re: Help with phrase indexing
Date Thu, 14 May 2009 12:48:38 GMT
Seid Mohammed:
> I need this exactly solution.
> Can you please tell me how could I DO IT?
> I am badly in nead of it

> > On Thu, May 14, 2009 at 5:58 AM, Ridzwan Aminuddin
> > <ra@world-check.com>wrote:
> >
> >> Hi all,
> >>
> >> Is Lucene able to index phrases instead if individual terms? If it is, can
> >> we also feed it a 'thesaurus or dictionary' of phrases that it should look
> >> out for when indexing. Thanks in advance,
> >>
> >> Ridzwan

Hi Seid.

I constructed something like this in my master degree. What I bascially
did was to write a custom analyzer. However I identified the top 100
most searched phrases in my collection, and then filtered for those. If
a document contained a identified phrase, then the analyzer would
construct a Token for those terms.

Another approach is to decide that only two-word phrases should be
search for. And, for instance, find verbs and use the "word right after".

Example string: "The boat was very fast".
Tokens: "The boat", "was", "very fast".

If your analyzer contain a range of phrases to search for, then it
should be straightforward to identify those phrases and index them as
Tokens. Just remember to set the start and end offset of the Token to
correct values.

You can have a look at this thesis here:
http://asbjorn.fellinghaug.com/filer/master/Master_thesis.pdf

And the java source code here:
http://daim.idi.ntnu.no/show.php?type=vedlegg&id=3429

Hope this helps.

-- 
Asbjørn A. Fellinghaug
asbjorn@fellinghaug.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message