lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <sk...@sloan.mit.edu>
Subject Re: Lucene query with long strings
Date Wed, 24 Mar 2010 13:20:27 GMT
Add the common terms such as "University", "School", "Medicine",
"Institute" etc. to stopwords list, so you are left with Stanford,
"Palo Alto" etc.

Then use Ahmet's suggestion of using a booleanquery
.setMinimumNumberShouldMatch() to (say) 75% of the query string
length.

Finally, if you wish to be very precise, you can loop through the hits
collector and use a string comparison algorithm like Jaro-Winkler,
Levenstein etc. for a second-level filter.



On Tue, Mar 23, 2010 at 5:08 PM, Aaron Schon <aaron_schon@yahoo.com> wrote:
> hi all, I have been playing with Lucene for a while now, but stuck on a perplexing issue.
>
> I have an index, with a field "Affiliation", some example values are:
>
> - "Stanford University School of Medicine, Palo Alto, CA USA",
> - "Institute of Neurobiology, School of Medicine, Stanford University, Palo Alto, CA",
> - "School of Medicine, Harvard University, Boston MA",
> - "Brigham & Women's, Harvard University School of Medicine, Boston, MA"
> - "Harvard University, Cambridge MA"
>
> and so on... (the bottom-line being the affiliations are written in multiple ways with
no apparent consistency)
>
> I query the index on  the affiliation field using say "School of Medicine, Stanford
University, Palo Alto, CA" (with QueryParser) to find all Stanford related documents, I get
a lot of false +ves, presumably because of the presence of School of Medicine etc. etc. (note:
I cannot use Phrase query because of variability in the way affiliation is constructed)
>
> I have tried the following:
>
> 1. Use a SpanNearQuery by splitting the search phrase with a whitespace (here I get no
results!)
> 2. Tried boosting (using ^) by splitting with the comma and boosting the last parts such
as "Palo Alto CA" with a much higher boost than the initial phrases. Here I still get lots
of false +ves.
>
> Any suggestions on how to approach this? Is SpanNear the way to go? Any other ideas on
why I get 0 results?
>
> Thanks in advance for helping a newbie.
>
> AS
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message