Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Mime-Version: 1.0 (Apple Message framework v730)
In-Reply-To: <e5de07e9050723014576436810@mail.gmail.com>
References: <e5de07e90507220659535566d7@mail.gmail.com>
 <98F97B3A-EC1A-4536-8981-50CBE378DF95@ehatchersolutions.com>
 <e5de07e9050723014576436810@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <6D00B570-28E3-41E0-9467-22F10FAA5D5E@ehatchersolutions.com>
Content-Transfer-Encoding: 7bit
From: Erik Hatcher <erik@ehatchersolutions.com>
Subject: Re: Extending the similarity class
Date: Sat, 23 Jul 2005 08:21:59 -0400
To: java-dev@lucene.apache.org


On Jul 23, 2005, at 4:45 AM, Ahmed El-dawy wrote:

>> Only terms returned from the Analyzer are considered, so if a stop
>> word is removed it does not count for tf or idf.
>>
> But I need to compare according to non indexed words also. By the way,
> goole does this.

Please provide an example or reference to support this claim.

Perhaps Google is doing something like what Nutch does by default  
with a bi-gram technique of joining terms that begin with a common  
term with the successive term and overlapping it position-increment- 
wise.  This technique allows searches to be fast when stop words need  
to be considered, but also optimized to avoid searching by stop words  
when it is not a phrase query.

>> This will happen automatically with PhraseQuery with a slop factor.
>> The closer the words, the better the score.  However, with a pure
>> boolean query, proximity is not considered at all (nor should it
>> be).  You can use a large slop factor for phrases such as "quick
>> fox"~100 and see how the scores work then.
>>
> This means that all words must be in the result. This is not always
> the case in my application. If I am searching for "quick brown fox",
> "quick fox" is an acceptable result.

In the case of single term queries boolean OR'd together, Similaritys  
coord factor boosts results that have more clauses overlapped.  This  
does not take proximity of the words into consideration.

> I just need to know whether I need to resort the search results
> according to my criteria, or there are some methods to override which
> will bring results already sorted.

It seems like you're asking for a different type of Query than  
currently exists that can do a boolean OR but score based on  
proximity of the matching terms.   Without looking it up, perhaps  
SpanOrQuery already does this sort of thing - though I don't think so.

     Erik


>
>
> On 7/22/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
>>
>> On Jul 22, 2005, at 9:59 AM, Ahmed El-dawy wrote:
>>
>>
>>> Hello,
>>>   I am using lucene to search plain text, but the order of the  
>>> search
>>> results is not satisfying to my needs. First, I want to know how the
>>> similarity works. Then, I need to extend it.
>>>
>>
>> Use IndexSearcher.explain() to see how each individual hit is scored
>> against a Query - this will be the clearest way to see why things
>> score the way they do.
>>
>>
>>>   First, does the similarity class work on analyzed text or original
>>> search text? To be precise, does it count the stop words as found
>>> terms or not?
>>>
>>
>> Only terms returned from the Analyzer are considered, so if a stop
>> word is removed it does not count for tf or idf.
>>
>>
>>>   Second, I want to add a factor of how relative are the terms of  
>>> the
>>> query found in text. For example, when I search for "quick fox",  
>>> "fox
>>> quick" and "quick brown fox" will be less ranked than "quick fox".
>>>
>>
>> This will happen automatically with PhraseQuery with a slop factor.
>> The closer the words, the better the score.  However, with a pure
>> boolean query, proximity is not considered at all (nor should it
>> be).  You can use a large slop factor for phrases such as "quick
>> fox"~100 and see how the scores work then.
>>
>>     Erik
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
>
> -- 
> Regards,
> Ahmed Saad
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org