lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: search quality - assessment & improvements
Date Mon, 25 Jun 2007 18:19:43 GMT
Hey Grant, thanks for your comments!

Grant Ingersoll wrote:

> As I am sure you are aware:
> LUCENE-836

I remembered you mentioning setting our own doc/query judgment system but
forgot it was in LUCENE-836, thanks for the reminder.

> On Jun 25, 2007, at 3:15 AM, Doron Cohen wrote:

> > I found out that quality can be enhanced by modifying the doc length
> > normalization, and by changing the tf() computation to also
> > consider the
> > average tf() in a single document.

> Further complicated by apps that duplicate fields for things like
> case-sensitive search, etc.  This is where having more field
> semantics would be
> useful, ala Solr or some other mechanism.

I agree, there's no single magic solution for all apps and situations.

> Also, are you making these judgements based on TREC?

Yes that's right. And again, TREC data does not reflect all there is in the
world, but I believe we can improve by that measure..

> > their "Agreement Concerning Dissemination of TREC Results" -
> > - and I am not feeling
> > smarter about this.

> IANAL and I didn't read the link, but I think people publish their
> MAP scores, etc. all the time on TREC data.  I think it implies that
> you obtained the data through legal means.

So you're saying that if person "X" got the TREC data legally, we can have
in our (say) benchmarks age, something like:
  (*) Person "X" reports the following TREC measures...
And anyone discussing his TREC results with Lucene in Lucene's mailing
lists does this under the list assumption that he got the TREC data
legally. Sounds practical to me, at least to start with.

> I agree about providing the mechanism to work with TREC.

Great, I will continue this in LUCENE-836.

> I also have
> had a couple of other thoughts/opinions/alternatives (my own,
> personal opinion):
> 1. Create our own judgements on Wikipedia or the Reuters collection.
> This is no doubt hard and would require a fair number of volunteers
> and could/would compete at some level with TREC.  One advantage is
> the whole process would be open, whereas the TREC process is not.  It
> would be slow to develop, too, but could be highly useful to the
> whole IR community.  Perhaps we could make a case at SIGIR or
> something like that for the need for a truly open process.  Perhaps
> we could post on SIGIR list or something to gauge interest.  I don't
> really know if that is the proper place or not.  I have just recently
> subscribed to the mailing list, so I don't have a feel for the
> postings on that list.  Perhaps a new project?  Lucene Relevance,
> OpenTREC, FreeTREC?  Seriously, Nutch could use relevance judgments
> for the "web track" and Solr could use it for several tracks, and
> Lucene J. as well.  And I am sure there are a lot of other OS search
> engines that would benefit.

This sounds very nice to me, but it is a great effort. I think we should go
this path only as a last resort.

> 2.  Petition NIST to make TREC data available to open source search
> projects.  Perhaps someone acting as an official part of ASF could
> submit a letter (I am willing to do so, I guess, given help drafting
> it) after it goes through legal, etc.  I'm thinking of something
> similar to what has been going on with the Open Letter to Sun
> concerning the Java implementation.  Perhaps simply asking would be
> enough to start a dialog on how it could be done.  We may have to
> come up w/ safeguards on downloads or something, I don't know.  I
> would bet the real issue with data is that it is copyrighted and we
> are paying to license it.  Perhaps we should start lobbying TREC to
> use non-copyrighted information.

This would be my preference, allowing to build on existing (and evolving)
TREC data (docs/queries/assessments).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message