lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Lucene query with long strings
Date Tue, 23 Mar 2010 22:05:53 GMT
Hi Aaron,

Your "false positives" comments point to a mismatch between what you're currently asking Lucene
for (any document matching any one of the terms in the query) and what you want (only fully
"correct" matches).

You need to identify the terms of the query that MUST match and tell Lucene about it ("+"
syntax is understood by QueryParser to mean a required term).

If your queries come from sources that don't reliably match the indexes values, you may need
to use synonyms to map between e.g. "California" and "CA", and then require that at least
one of the synonyms matches (e.g. "+(California CA)").

Steve

On 03/23/2010 at 5:08 PM, Aaron Schon wrote:
> hi all, I have been playing with Lucene for a while now, but stuck on a
> perplexing issue.
> 
> I have an index, with a field "Affiliation", some example values are:
> 
> - "Stanford University School of Medicine, Palo Alto, CA USA", -
> "Institute of Neurobiology, School of Medicine, Stanford University,
> Palo Alto, CA", - "School of Medicine, Harvard University, Boston MA", -
> "Brigham & Women's, Harvard University School of Medicine, Boston, MA" -
> "Harvard University, Cambridge MA"
> 
> and so on... (the bottom-line being the affiliations are written in
> multiple ways with no apparent consistency)
> 
> I query the index on  the affiliation field using say "School of
> Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to
> find all Stanford related documents, I get a lot of false +ves,
> presumably because of the presence of School of Medicine etc. etc.
> (note: I cannot use Phrase query because of variability in the way
> affiliation is constructed)
> 
> I have tried the following:
> 
> 1. Use a SpanNearQuery by splitting the search phrase with a whitespace
> (here I get no results!)
> 2. Tried boosting (using ^) by splitting with the comma and boosting
> the last parts such as "Palo Alto CA" with a much higher boost than the
> initial phrases. Here I still get lots of false +ves.
> 
> Any suggestions on how to approach this? Is SpanNear the way to go? Any
> other ideas on why I get 0 results?
> 
> Thanks in advance for helping a newbie.
> 
> AS


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message