lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donna L Gresh <gr...@us.ibm.com>
Subject RE: Is there a way for me to handle a multiword synonym correctly?
Date Fri, 07 Aug 2009 16:15:18 GMT
I have to think about this a bit, but that may work. I just have to make 
sure no "undesirable" side effects occur. I certainly want to be able to 
search for a phrase and not have it match all the individual bits, but 
that should already work using the mechanism I already have in place.

Donna 


"Carl Austin" <Carl.austin@detica.com> wrote on 08/07/2009 10:50:08 AM:

> [image removed] 
> 
> RE: Is there a way for me to handle a multiword synonym correctly?
> 
> Carl Austin 
> 
> to:
> 
> java-user
> 
> 08/07/2009 10:50 AM
> 
> Please respond to java-user
> 
> I may be over simplifying here but in this case don't you just need to
> use an analyzer that breaks the word "SAP.EM.FIN.AM" on full stops and
> throws them out, so that it is indexed as terms "SAP" "EM" "FIN" "AM".
> This is the same as it will index "SAP EM FIN AM" as long as you break
> on whitespace too. I.E SimpleAnalyzer (runs of letter characters are
> tokens)
> 
> Then the query for "SAP EM FIN AM" will match both.
> 
> Carl
> 
> 
> -----Original Message-----
> From: Donna L Gresh [mailto:gresh@us.ibm.com] 
> Sent: 07 August 2009 15:35
> To: java-user@lucene.apache.org
> Subject: Is there a way for me to handle a multiword synonym correctly?
> 
> I saw some discussion on the board but I'm not sure I've got quite the
> same problem. As an example, I have a query that might be a technical
> skill:
> 
> SAP EM FIN AM
> 
> I would like that to match a document that has *either* SAP.EM.FIN.AM or
> "SAP EM FIN AM" (in that order and all together, not spread out through
> the document).
> 
> The approach I had tried was at index time if I saw SAP.EM.FIN.AM I
> would consider "SAP EM FIN AM" a synonym for it, using the Lucene in
> Action example. Luke shows me that I have two terms in the index for
> this
> document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears
> differently in the index than if it had been organically found as just
> the string of tokens, in which case there would be separate terms for
> SAP, EM, and so on. 
> 
> At query time if I look for "SAP EM FIN AM" it is formed as a phrase
> query with a slop of 0 which does *not* match the one term version "SAP
> EM FIN AM". (For that matter a simple boolean query doesn't find it
> either) Luke confirms the fact that the phrase query does not find my
> synonym term. The query "SAP EM FIN AM" finds *only* documents that
> originally had those separated tokens in them.
> 
> Is there a way to handle this situation such that at index time I can
> turn SAP.EM.FIN.AM into something that will be found with a query for
> "SAP EM FIN AM"?
> 
> Thanks for any pointers
> 
> Donna 
> 
> 
> 
> This message should be regarded as confidential. If you have 
> received this email in error please notify the sender and destroy it
> immediately.
> Statements of intent shall only become binding when confirmed in 
> hard copy by an authorised signatory.  The contents of this email 
> may relate to dealings with other companies within the Detica 
> Limited group of companies.
> 
> Detica Limited is registered in England under No: 1337451.
> 
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, 
England.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message