lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carl Austin" <Carl.aus...@detica.com>
Subject RE: Is there a way for me to handle a multiword synonym correctly?
Date Fri, 07 Aug 2009 14:50:08 GMT
I may be over simplifying here but in this case don't you just need to
use an analyzer that breaks the word "SAP.EM.FIN.AM" on full stops and
throws them out, so that it is indexed as terms "SAP" "EM" "FIN" "AM".
This is the same as it will index "SAP EM FIN AM" as long as you break
on whitespace too. I.E SimpleAnalyzer (runs of letter characters are
tokens)

Then the query for "SAP EM FIN AM" will match both.

Carl


-----Original Message-----
From: Donna L Gresh [mailto:gresh@us.ibm.com] 
Sent: 07 August 2009 15:35
To: java-user@lucene.apache.org
Subject: Is there a way for me to handle a multiword synonym correctly?

I saw some discussion on the board but I'm not sure I've got quite the
same problem. As an example, I have a query that might be a technical
skill:

SAP EM FIN AM

I would like that to match a document that has *either* SAP.EM.FIN.AM or
"SAP EM FIN AM" (in that order and all together, not spread out through
the document).

The approach I had tried was at index time if I saw SAP.EM.FIN.AM I
would consider "SAP EM FIN AM" a synonym for it, using the Lucene in
Action example. Luke shows me that I have two terms in the index for
this
document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears
differently in the index than if it had been organically found as just
the string of tokens, in which case there would be separate terms for
SAP, EM, and so on. 

At query time if I look for "SAP EM FIN AM" it is formed as a phrase
query with a slop of 0 which does *not* match the one term version "SAP
EM FIN AM". (For that matter a simple boolean query doesn't find it
either) Luke confirms the fact that the phrase query does not find my
synonym term. The query "SAP EM FIN AM" finds *only* documents that
originally had those separated tokens in them.

Is there a way to handle this situation such that at index time I can
turn SAP.EM.FIN.AM into something that will be found with a query for
"SAP EM FIN AM"?

Thanks for any pointers

Donna 



This message should be regarded as confidential. If you have received this email in error
please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by an authorised
signatory.  The contents of this email may relate to dealings with other companies within
the Detica Limited group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message