lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed El-dawy <aseld...@gmail.com>
Subject Re: Extending the similarity class
Date Sat, 23 Jul 2005 14:33:47 GMT
> Please provide an example or reference to support this claim.
I do not know how google works internally, but search results are
different when I add stop words to the search query. Try to search
"lucene" and "a lucene". Search results are different, or at least the
order differs. This is very easy to try and will not take you 1
minute.

> It seems like you're asking for a different type of Query than
> currently exists that can do a boolean OR but score based on
> proximity of the matching terms.   Without looking it up, perhaps
> SpanOrQuery already does this sort of thing - though I don't think so.
I will see this SpanOrQuery anyway. But it's no doubt I will need to
sort the search results. At least to match the terms before being
analyzed. It is not the problem of stop words only, but the analyzer I
use modifies some terms by removing its prefix and suffix and changing
some letters. Can I do this thing by extending some classes or I will
have to make it (outside) lucene after search results are returned?


On 7/23/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
> 
> On Jul 23, 2005, at 4:45 AM, Ahmed El-dawy wrote:
> 
> >> Only terms returned from the Analyzer are considered, so if a stop
> >> word is removed it does not count for tf or idf.
> >>
> > But I need to compare according to non indexed words also. By the way,
> > goole does this.
> 
> Please provide an example or reference to support this claim.
> 
> Perhaps Google is doing something like what Nutch does by default
> with a bi-gram technique of joining terms that begin with a common
> term with the successive term and overlapping it position-increment-
> wise.  This technique allows searches to be fast when stop words need
> to be considered, but also optimized to avoid searching by stop words
> when it is not a phrase query.
> 
> >> This will happen automatically with PhraseQuery with a slop factor.
> >> The closer the words, the better the score.  However, with a pure
> >> boolean query, proximity is not considered at all (nor should it
> >> be).  You can use a large slop factor for phrases such as "quick
> >> fox"~100 and see how the scores work then.
> >>
> > This means that all words must be in the result. This is not always
> > the case in my application. If I am searching for "quick brown fox",
> > "quick fox" is an acceptable result.
> 
> In the case of single term queries boolean OR'd together, Similaritys
> coord factor boosts results that have more clauses overlapped.  This
> does not take proximity of the words into consideration.
> 
> > I just need to know whether I need to resort the search results
> > according to my criteria, or there are some methods to override which
> > will bring results already sorted.
> 
> It seems like you're asking for a different type of Query than
> currently exists that can do a boolean OR but score based on
> proximity of the matching terms.   Without looking it up, perhaps
> SpanOrQuery already does this sort of thing - though I don't think so.
> 
>     Erik
> 
> 
> >
> >
> > On 7/22/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
> >
> >>
> >> On Jul 22, 2005, at 9:59 AM, Ahmed El-dawy wrote:
> >>
> >>
> >>> Hello,
> >>>   I am using lucene to search plain text, but the order of the
> >>> search
> >>> results is not satisfying to my needs. First, I want to know how the
> >>> similarity works. Then, I need to extend it.
> >>>
> >>
> >> Use IndexSearcher.explain() to see how each individual hit is scored
> >> against a Query - this will be the clearest way to see why things
> >> score the way they do.
> >>
> >>
> >>>   First, does the similarity class work on analyzed text or original
> >>> search text? To be precise, does it count the stop words as found
> >>> terms or not?
> >>>
> >>
> >> Only terms returned from the Analyzer are considered, so if a stop
> >> word is removed it does not count for tf or idf.
> >>
> >>
> >>>   Second, I want to add a factor of how relative are the terms of
> >>> the
> >>> query found in text. For example, when I search for "quick fox",
> >>> "fox
> >>> quick" and "quick brown fox" will be less ranked than "quick fox".
> >>>
> >>
> >> This will happen automatically with PhraseQuery with a slop factor.
> >> The closer the words, the better the score.  However, with a pure
> >> boolean query, proximity is not considered at all (nor should it
> >> be).  You can use a large slop factor for phrases such as "quick
> >> fox"~100 and see how the scores work then.
> >>
> >>     Erik
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >>
> >
> >
> > --
> > Regards,
> > Ahmed Saad
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 


-- 
Regards,
Ahmed Saad

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message