lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Kelvin <george.kelvin...@gmail.com>
Subject Re: Questions about FuzzyQuery in Lucene 4.x
Date Tue, 29 Jan 2013 19:43:03 GMT
Hi Jack,

The problematic query is "scar"+"wads".

There are several (more than 10) documents in the data with the content
"star wars", so I think that query should be able to find all these
documents.

I was trying to provide a minimal test case, but I couldn't reduce the size
of data showing the failure.

The size of the minimal data showing the failure I got so far is around 2
million.

However, I found a suspicious document with content "scor". If I remove it
from the 2 million documents data, that query can find all the "star wars"
documents. If I add it back, then the query can't find any.

I tried to reduce the size of the data to 1 million further and add that
"scor" document, but now the query can still find all the "star wars"
documents.

Is it possible that Lucene somehow fail to find all the valid terms within
the edit distance?

Thanks!

George


On Tue, Jan 29, 2013 at 10:02 AM, Jack Krupansky <jack@basetechnology.com>wrote:

> I also noticed that you have "MUST" for your full string of fuzzy terms -
> that means everyone of them must appear in an indexed document to be
> matched. Is it possible that maybe even one term was not in the same
> indexed document?
>
> Try to provide a complete example that shows the input data and the query
> - all the literals. In other words, construct a minimal test case that
> shows the failure.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: George Kelvin
> Sent: Tuesday, January 29, 2013 12:28 PM
>
> To: java-user@lucene.apache.org
> Subject: Re: Questions about FuzzyQuery in Lucene 4.x
>
> Hi Jack,
>
> ed is set to 1 here and I have lowercased all the data and queries.
>
> Regarding the indexed data factor you mentioned, can you elaborate more?
>
> Thanks!
>
> George
>
>
> On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky <jack@basetechnology.com>*
> *wrote:
>
>  That depends on the value of "ed", and the indexed data.
>>
>> Another factor to take into consideration is that a case change ("Star"
>> vs. "star") also counts as an edit.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: George Kelvin
>> Sent: Tuesday, January 29, 2013 11:49 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Questions about FuzzyQuery in Lucene 4.x
>>
>>
>> Hi Jack,
>>
>> Thanks for your reply!
>>
>> I don't think I passed the prefixLength parameter in.
>>
>> Here is the code I used to build the FuzzyQuery:
>>
>>            String[] words = str.split("\\+");
>>            BooleanQuery query = new BooleanQuery();
>>
>>            for (int i=0; i<words.length; i++)
>>            {
>>                Term t = new Term(field, words[i]);
>>                FuzzyQuery fq = new FuzzyQuery(t, ed);
>>                query.add(fq, BooleanClause.Occur.MUST);
>>            }
>>
>>            int k = 10;
>>            TopDocs results = searcher.search(query, k);
>>
>> Does it look right to you?
>>
>> Thanks!
>>
>> George
>>
>> ------------------------------****----------------------------**
>> --**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org<
>> java-user-**unsubscribe@lucene.apache.org<java-user-unsubscribe@lucene.apache.org>
>> >
>> For additional commands, e-mail: java-user-help@lucene.apache.****org<
>> java-user-help@lucene.**apache.org <java-user-help@lucene.apache.org>>
>>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message