lucene-openrelevance-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <>
Subject Lots to talk about! (was Re: Any Java/ASF qrels code yet?)
Date Wed, 23 Dec 2009 19:37:18 GMT
Hello Simon and Robert,

Robert, yes, I do have a private corpus and truth table.  At this point I
can't share it, though I'll ask my client at some point.

I did find some code in JiRA, your patches, including the links here "for
the record":

Top languages for testing are English, Japanese, French and German.

I'm exicited to have others to talk to!  I have some general comments /

1: Although the qrels data format was originally binary yes/no, apparently
there were more flexible dialects used in later years, that allowed for some
weighting.  Was there a particular dialect that y'all were considering?

2: CAN WE use the TREC qrels format(s)?

I believe TREC has various restrictions on the use of test results, source
content and evaluation code (annoying since TREC is supposed to foster
research and NIST is paid for by US tax dollars, but that's another whole
rant)  But do we think the file format is "open" or "closed" ?

3: I do favor an optional "analog" scale.  Do you agree?

Our assertions are on an A-F scale, I can elaborate if you're interested.  A
floating point scale is more precise perhaps, but we have human graders, and
explain letter grades that approximate academic rankings was less confusing,
plus we were already using numbers in two other aspects of the grading form.

4: Generally do you guys favor a simple file format (one line per record),
or an XML format?

TREC was born in the early 90's I guess, so is record oriented, and probably
more efficient.  We have our tests in an XML format, which though more
verbose, affords a lot more flexibility including comments and optionally
self-contained content.  It also sidesteps encoding as XML is UTF-8.  I've
found that "text files" from other countries tend to be numerous encodings.
And Excel, which is often used for CSV and other delimited files, sadly does
NOT do UTF-8 in CSV files.

5: How important do you value interoperability with Excel?

It's VERY handy for non-techies, and the xlsX format is a set of zipped XML
files, so perhaps acceptably "open".  I would not propose .xlsx as the
standard format, but it'd be nice to inter-operate with it.  We'd need some
type of template.

6: "quiescent" vs. "dynamic" strategies

Content: During in house testing it's sometimes been hard to maintain a
static set of content.  You can have a separate system, but I suspect in
some scenarios it won't be feasible to lock down the content.  See item 10
below.  My suggestion is to mix this into the thinking.  Some researches
wouldn't accept the variables it adds, but for others if it's a choice
between imperfect checks and no checks at all, they'll take the imperfect.

Grades / Evaluations: It's VERY hard to get folks to grade an MxN matrix.  I
had a matrix of just 57 x 25 (> 1,400 spots) and, trust me, it's hard to do
in one sitting.  It'd be nice to handle spotty valuations.

7: fuzzy evaluations vs. "unit testing"

Given the variabilities (covered in other points), it'd be nice to come up
with fuzzier assertions.

* "Doc1 is more relevant to Search1 than Doc2"
* "I'd like to see at least 3 of these docs in the top 10 matches for this

8: URLs as keys (optional, handy in some contexts)
Various technical issues here, just wanted to bring it up.

9: Ideas for an "(e)valuation console" / "crowd sourcing"

There are several ways to present searches and answers to users in a
somewhat reasonable way, to make it a bit easier / fun for them to make
assertions.  Lots of ways to go here, but we'd need some UI resources.

10: "academic" vs. "real-world" focus, can't we serve both!?

Some areas of search R&D aren't applicable to real world / commercial
usage.  TREC is a perfect example of this.  Some open source licenses also
prevent commercial participation.  And I can imagine some testing standards
that, while very well thought out and thorough, might be impractical to
actually use.

I really think we can serve both groups, and will get better results for our

11: Task appropriateness

There are different tasks that folks might want to use Relevancy Testing
Tools for:
* Engine A vs. Engine B
* Configuration A vs. Configuration B (same engine)
* "normal variable" vs. "acceptable" vs. "unacceptable"

We should keep these different use cases in mind.

12: Clusters and Relevancy grading:  Do you agree with the following

If you manually cluster documents by subject, then presumably using one of
those documents (perhaps a shorter one), you'd expect it to generally find
the other documents in that cluster, presumably at a higher relevance than
other documents from other clusters.

*If* this were true, it suggests some automated testing methods.

I'll say that I don't think it's entirely true, but I think it's one
technique to keep in mind, for some use cases.  This by itself is probably a
long discussion.

13: Problems with measurements...

Just listing some of the stuff I've been worried about:
* Individual opinion drift (me before coffee on test # 10 vs. me after 8
cups of coffee on test # 500; if I go back to test 10, would I grade it the
* Tester variance (how closely would 2 coworkers grade the same search
against the same docset)
* Language drift - if I translate both the questions and searches into
French, then have a French speaker evaluate the results, how close should I
expect them to be?
* Ordering drift - I've seen this myself, you mind and habits can change as
you go through many tests, also sorted vs. unsorted data

And if using clustering as part of your testing:
* Cluster drifts - cluster started out as "Windows", but as docs are added
becomes more about "Windows installation and drivers", etc
* Cluster spans - a small cluster might be about Microsoft Office
applications, but one particular document is about ease of use of Powerpoint
2007 vs. another about problems installing Office on a Mac; but other 3 docs
in the cluster gradually span this gamut
Cluster split / merge - Windows cluster is now about "Windows installation"
vs. "Windows applications"

14: Long tail / variability / Poisson issues

Sample A is 1,000 test docs and 100 searches from a particular web site.
Sample B is another 1,000 test docs and 100 searches (non-overlapping) from
the same web site, over the same period of time
Sample C is 10,000 test docs and 1,000 searches (also not overlapping, and
from the same time-frame / site)

You'll see very common themes in all sets of docs and searches, that will
clearly overlap.

However, the long tail quickly gets into 1 item samples, and you'll find the
2 tails do not overlap.  This has something to do with testing... but is
probably a long subject for another day.

And since sample C is 10x samples A and B, what variance can be explained
simply due to that fact?  For example, is the tail 10x longer, or maybe
sqrt(10) longer?  etc.

15: Participation in ORP

As some of you know we're active in, plus our newsletter, plus
we work with Lucid on webinars sometimes.  So there's a bunch of ways we
could publicize this group, when we're ready.

Any of you Bay Area?

And should we "take on TREC" ?

Mark Bennett / New Idea Engineering, Inc. /
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

On Tue, Dec 22, 2009 at 3:45 PM, Robert Muir <> wrote:

> Hello,
> I wasn't sure what you were asking, do you already have your own
> corpus/queries/judgements and are looking for something similar to
> trec_eval?
> If so, the lucene-java benchmark 'quality' package can already take queries
> and judgements files, and run the experiment against a lucene index, and
> finally produce "similar" summary output.
> On Tue, Dec 22, 2009 at 5:01 PM, Mark Bennett <>wrote:
>> It's a small world!  I downloaded the TREC C code today, but noticed the
>> non-commercial use copyright.
>> I went to go find some open source qrels code, and found you guys already
>> hard at work.
>> So.... where's the code?  :-)
>> --
>> Mark Bennett / New Idea Engineering, Inc. /
>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> --
> Robert Muir

View raw message