lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: search quality - assessment & improvements
Date Mon, 25 Jun 2007 12:24:41 GMT
Just to throw in a few things:

First off, this is great!

As I am sure you are aware: https://issues.apache.org/jira/browse/ 
LUCENE-836

On Jun 25, 2007, at 3:15 AM, Doron Cohen wrote:

>
> hi, this could probably split into two threads but for context  
> let's start
> it in a single discussion;
>
> Recently I was looking at the search quality of Lucene - Recall and
> Precision, focused at P@1,5,10,20 and, mainly, MAP.
>
> -- Part 1 --
>
> I found out that quality can be enhanced by modifying the doc length
> normalization, and by changing the tf() computation to also  
> consider the
> average tf() in a single document.
>
> For the first change, logic is that Lucene's default length  
> normalization
> punishes long documents too much. I found contrib's sweet-spot- 
> similarity
> helpful here, but not enough. I found that a better doc-length
> normalization method is one that considers collection statistics -  
> e.g.
> average doc length. The nice problem with such an approach is that you
> don't know the average length at indexing time, and it changes as  
> the index
> evolves. The static nature of norms computation (and API) in Lucene  
> is,
> while efficient, an obstacle for global computations. Another issue  
> here is
> that applications often split documents into fields from reasons  
> that are
> not "pure IR", for instance - content field and title field, just  
> to be
> able to boost the title by (say) 3, but in fact, there is no "IR'ish"
> difference between finding the searched text in the title field or  
> in the
> body field - they really serve/answer the same information need.  
> For that
> matter, I believe that using a single document length when  
> searching all
> these fields is more "accurate".

Further complicated by apps that duplicate fields for things like  
case-sensitive search, etc.  This is where having more field  
semantics would be
useful, ala Solr or some other mechanism.

Also, are you making these judgements based on TREC?


>
> For the second change logic, - assume two documents, doc1  
> containing 10
> "A"'s, 10 "B"'s, and 10 "Z"'s, and doc2 containing "A" to "T" and  
> 10 "Z"'s.
> Both doc1 and doc2 are of length 30. Searching for "Z", in both  
> doc1 and
> doc2 tf("Z")=10. So, currently, doc1 and doc2 score the same for  
> "Z", but
> the "truth" is that "Z" is much more representing/important in doc2  
> than it
> is in doc1, because its frequency in doc2 is 10 times more than all  
> the
> other words in that doc, while in doc1 it is the same as the other  
> words in
> that doc. If you agree about the potential improvement here, again,  
> a nice
> problem is that current Similarity API does not even allow to  
> consider this
> info (the average term frequency in the specific document) because
> Similarity.tf(int/float freq) takes only the frequency param. One  
> way to
> open way for such computation is to add an "int docid" param to the
> Similarity class, but then the implementation of that class becomes
> IndexReader aware.
>
> Both modifications above have, in addition to API implications also
> performance implications, mainly search performance, and I would  
> like to
> get some feedback on what people think about going in this  
> direction...
> first the "if", only then the "how"...

Perhaps revisiting Flexible Indexing is the way to go.  The trick  
will be in how to write an API that supports the current way, but  
also allows us to add new methods for these kind of things.

>
> -- Part 2 --
>
> It is very important that we would be able to assess the search  
> quality in
> a repeatable manner - so that anyone can repeat the quality tests, and
> maybe find ways to improve them. (This would also allow to verify the
> "improvements claims" above...). This capability seems like a  
> natural part
> of the benchmark package. I started to look at extending the benchmark
> package with search quality module, that would open an index (or first
> create one), run a set of queries (similar to the performance  
> benchmark),
> and compute and report the set of known statistics mentioned above and
> more. Such a module depends on input data - documents, queries, and
> judgements. And that's my second question. We don't have to invent  
> this
> data - TREC has it already, and it is getting wider every year as  
> there are
> more judgements. So, theoretically we could use TREC data. One  
> problem here
> is that TREC data should be purchased. Not sure that this is a  
> problem - it
> is OK if we provide the mechanism to use this data for those who  
> have it
> (Universities, for one). The other problem is that it is not clear  
> to me
> what can one legally say on a certain system's results on TREC data. I
> would like the Search Quality Web page of Lucene to say something  
> like:
> "MAP of XYZ for Track Z of TREC 2004", and then a certain submitted  
> patch
> to say "I improved to 1.09*XYZ". But would that be legal? I just re- 
> read
> their "Agreement Concerning Dissemination of TREC Results" -
> http://trec.nist.gov/act_part/forms/noads.html - and I am not feeling
> smarter about this.

IANAL and I didn't read the link, but I think people publish their  
MAP scores, etc. all the time on TREC data.  I think it implies that  
you obtained the data through legal means.

I agree about providing the mechanism to work with TREC.  I also have  
had a couple of other thoughts/opinions/alternatives (my own,  
personal opinion):

1. Create our own judgements on Wikipedia or the Reuters collection.   
This is no doubt hard and would require a fair number of volunteers  
and could/would compete at some level with TREC.  One advantage is  
the whole process would be open, whereas the TREC process is not.  It  
would be slow to develop, too, but could be highly useful to the  
whole IR community.  Perhaps we could make a case at SIGIR or  
something like that for the need for a truly open process.  Perhaps  
we could post on SIGIR list or something to gauge interest.  I don't  
really know if that is the proper place or not.  I have just recently  
subscribed to the mailing list, so I don't have a feel for the  
postings on that list.  Perhaps a new project?  Lucene Relevance,  
OpenTREC, FreeTREC?  Seriously, Nutch could use relevance judgments  
for the "web track" and Solr could use it for several tracks, and  
Lucene J. as well.  And I am sure there are a lot of other OS search  
engines that would benefit.

2.  Petition NIST to make TREC data available to open source search  
projects.  Perhaps someone acting as an official part of ASF could  
submit a letter (I am willing to do so, I guess, given help drafting  
it) after it goes through legal, etc.  I'm thinking of something  
similar to what has been going on with the Open Letter to Sun  
concerning the Java implementation.  Perhaps simply asking would be  
enough to start a dialog on how it could be done.  We may have to  
come up w/ safeguards on downloads or something, I don't know.  I  
would bet the real issue with data is that it is copyrighted and we  
are paying to license it.  Perhaps we should start lobbying TREC to  
use non-copyrighted information.  Maybe if we got enough open source  
search libraries interested we could make some noise!  Maybe we could  
all go protest outside of the TREC conference!  Ha, ha, ha!  We would  
need a catchy chant, though.  And if anyone thinks I am serious about  
this last part, I am not.


Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message