lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nada Mimouni" <mimo...@tk.informatik.tu-darmstadt.de>
Subject RE: Phrase indexing and searching with Lucene
Date Thu, 19 Feb 2009 08:14:55 GMT

Hello,

Thank you Erick for this detailed answer, that makes things clearer in my mind.

>I'm still not clear why the built-in phrase query syntax won't work.

I have programmed a set of java classes (I use Lucene classes) to index and search into a
collection of documents for a set of queries.
To test my system, I use a corpus which consists in a collection of queries (n queries) and
documents (m documents). 
I started by creating one index for all queries and another one for all documents. Then I
make the search to match between the queries index and documents index.
I use a trec evaluation tool to generate a file that gives all hits (matches) between the
queryID and documentID with different scores. 

In this first step, I just index terms, therefore the search process (as I have it now) looks
only for term matches between the query terms and the documents terms.
Now I want to get better results (better matching) by adding phrases to terms. 

I don't know exactly whether it makes a difference if I index phrases and terms (erick, erickson,
thinks, small, thoughts, erick erickson, erickson thinks, small thoughts, erickson thoughts)
and then search for both, or just keep the indexing process as it is (erick, erickson, thinks,
small, thoughts) and then make a search for phrases (PhraseQuery : erick erickson, erickson
thinks, small thoughts, erickson thoughts) and terms. 
Any idea?


>Some examples of what you put in your index and what searches
>you expect to return results for your example AND searches you do
>NOT want to hit that document would be a great help.

input: 

*Query*  
898    Why is the sun bright?

*Documents* 
7568  Star, large celestial body composed of gravitationally contained hot gases emitting
electromagnetic radiation, especially light, as a result of nuclear reactions inside the star.
The sun is a star.
7567  The sun has a magnitude of -26.7, inasmuch as it is about 10 billion times as bright
as Sirius in the earth's sky. 

output: 

qID     dID     score 
898     7568  	 0,13 (not relevant)
898 	7567 	 1     (relevant)


In this example, Lucene matches document 7567 to be relevant to he query (since it contains
all query terms), however bright here is relative to Sirius (what we need is to get "sun bright").




Best 
Nada


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Wed 2/18/2009 3:24 PM
To: java-user@lucene.apache.org
Subject: Re: Phrase indexing and searching with Lucene
 
I'm still not clear why the built-in phrase query syntax won't work. If I
index the following terms (erick, erickson, thinks, small, thoughts)
in a single field, then searching for "erick erickson" (as a phrase query,
i.e. with double quotes when sent through a query parser or constructing
a PhraseQuery yourself) will generate a hit but "erick thinks" won't
generate a hit (unless you specify slop).

"thinks small thoughts" would also generate a hit

If you're saying that you only want to match on *all* the tokens, i.e.
the only way to get a hit on the above would be to search for
"erick erickson thinks small thoughts", then you can create a
field that's UN_ANALYZED. If you do this, though, beware
that you have to do things like lower-case terms yourself when
indexing.

I have no idea what IndexTermGenerator is or what it does, but I'm
assuming that it just generates single words.

Some examples of what you put in your index and what searches
you expect to return results for your example AND searches you do
NOT want to hit that document would be a great help.

As far as searching for both, constructing a BooleanQuery with regular
TermQuerys and PhraseQuerys would work if you're constructing
your queries programmatically, or just using a Lucene query
like +termfield:word +phrasefield:"erick erickson thinks" would
work. Or, if you just require that the phrase exists you could do
it all in one field like
+field:word +field:"erick erickson thinks"



Best
Erick


On Wed, Feb 18, 2009 at 8:42 AM, Nada Mimouni <
mimouni@tk.informatik.tu-darmstadt.de> wrote:

>
>
> Thank you Erick.
>
> I need first to index phrases, the built-in phrase processing (with double
> quotes) comes in the search step.
> Is there any difference between :
>            1) start by indexing phrases and then make a phrase search
>            2) index terms and then search for phrases
>
>
> To make things clearer:
>
> What I am doing now:
>  - In the indexing step:  I am using "IndexTermGenerator" to generate term
> based indexes, one index for all queries I have and another one for
> documents (term means single word).
>  - In the search step : Lucene matches terms in queries index with terms in
> documents index.
>
> What I need to do:
>  - Index phrases ("multi" words) in addition to terms (single words)
>  - Search for both : phrases and terms
>
>
> Is there any idea on how to proceed?
>
> Regards
> Nada
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wed 2/18/2009 2:10 PM
> To: java-user@lucene.apache.org
> Subject: Re: Phrase indexing and searching with Lucene
>
> Have you tried the built-in phrase processing with double quotes? e.g.
> "this is a phrase"?
>
> See the Term section at
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
>
> Best
> Erick
>
> On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni <
> mimouni@tk.informatik.tu-darmstadt.de> wrote:
>
> >
> >
> > Hello everybody,
> >
> > I use Lucene to index and search into text documents.
> > At present, I just index and search for single words. I want to extend
> this
> > to phrases (or nGrams).
> >
> > Could anyone please give me details on how to index phrases and then make
> a
> > phrase search?
> >
> > Thank you very much in advance for your help.
> >
> > Nada Mimouni
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



Mime
View raw message