lucene-openrelevance-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Lots to talk about! (was Re: Any Java/ASF qrels code yet?)
Date Wed, 23 Dec 2009 20:22:08 GMT
Hello, I have inserted some quick comments below (not to all your questions,
only what i know and can quickly answer). Glad to see your excitement about
this project!

On Wed, Dec 23, 2009 at 2:37 PM, Mark Bennett <> wrote:

> Hello Simon and Robert,
> Robert, yes, I do have a private corpus and truth table.  At this point I
> can't share it, though I'll ask my client at some point.
> I did find some code in JiRA, your patches, including the links here "for
> the record":
> Top languages for testing are English, Japanese, French and German.

The ones we have here are unfortunately not those languages, but are instead
chosen because you can download them off the internet... so as far as
languages go, I personally like to see variety, but I think "we will take
whatever we can get" is the way it is right now :)

I think its probably hard to get people interested without english
available, even if we had really good collections in a bunch of other

> I'm exicited to have others to talk to!  I have some general comments /
> questions:
> 1: Although the qrels data format was originally binary yes/no, apparently
> there were more flexible dialects used in later years, that allowed for some
> weighting.  Was there a particular dialect that y'all were considering?

Whatever we can get :) What we have currently is just a quick hack to take
some existing collections and reformat them to the way that lucene's
benchmark package likes them. The idea is to have something very basic that
actually works for now.

In the future we should improve this.

> 2: CAN WE use the TREC qrels format(s)?
> I believe TREC has various restrictions on the use of test results, source
> content and evaluation code (annoying since TREC is supposed to foster
> research and NIST is paid for by US tax dollars, but that's another whole
> rant)  But do we think the file format is "open" or "closed" ?

What is TREC Qrel format? The tab-delimited text file format? I don't think
TREC can claim any copyright on tab-delimited text, or we have more serious
problems on our hands. Right now we are just putting it into the format that
lucene benchmark's package works with already.

Others have brought up using better formats for this stuff like XML, which
would be nice, of course but then we have to have something to actually do
stuff with this data (either modify lucene benchmark's quality pkg or make
our own that possibly works with solr, lucy, and other projects too)

> 3: I do favor an optional "analog" scale.  Do you agree?
> Our assertions are on an A-F scale, I can elaborate if you're interested.
> A floating point scale is more precise perhaps, but we have human graders,
> and explain letter grades that approximate academic rankings was less
> confusing, plus we were already using numbers in two other aspects of the
> grading form.

I think we should look at being able to use both binary and analog
assessments? Make use of whatever we have, whatever we can get, the best we

> 4: Generally do you guys favor a simple file format (one line per record),
> or an XML format?
> TREC was born in the early 90's I guess, so is record oriented, and
> probably more efficient.  We have our tests in an XML format, which though
> more verbose, affords a lot more flexibility including comments and
> optionally self-contained content.  It also sidesteps encoding as XML is
> UTF-8.  I've found that "text files" from other countries tend to be
> numerous encodings.  And Excel, which is often used for CSV and other
> delimited files, sadly does NOT do UTF-8 in CSV files.

The way it works now, we reformat it to whatever we need. I think even if
the current 'output' is some tab-delimited thing, XML is actually better as
input even with the minimal stuff we have now because it makes this
reformatting job easier (the parsing is simple).

> 5: How important do you value interoperability with Excel?
> It's VERY handy for non-techies, and the xlsX format is a set of zipped XML
> files, so perhaps acceptably "open".  I would not propose .xlsx as the
> standard format, but it'd be nice to inter-operate with it.  We'd need some
> type of template.

Can you elaborate? Do you mean produce some excel reports or something at
the end of the day? What would be handy to be in excel?

> 6: "quiescent" vs. "dynamic" strategies
> Content: During in house testing it's sometimes been hard to maintain a
> static set of content.  You can have a separate system, but I suspect in
> some scenarios it won't be feasible to lock down the content.  See item 10
> below.  My suggestion is to mix this into the thinking.  Some researches
> wouldn't accept the variables it adds, but for others if it's a choice
> between imperfect checks and no checks at all, they'll take the imperfect.
> Grades / Evaluations: It's VERY hard to get folks to grade an MxN matrix.
> I had a matrix of just 57 x 25 (> 1,400 spots) and, trust me, it's hard to
> do in one sitting.  It'd be nice to handle spotty valuations.

For now, what we have is minimal and only works with static content: the
corpus is static and the benchmark pkg creates an index and runs the eval. I
think dynamic content is interesting but more complex to do... can you
elaborate on more ideas here?

> 7: fuzzy evaluations vs. "unit testing"
> Given the variabilities (covered in other points), it'd be nice to come up
> with fuzzier assertions.
> Examples:
> * "Doc1 is more relevant to Search1 than Doc2"
> * "I'd like to see at least 3 of these docs in the top 10 matches for this
> search"

I am not really sure how these would work for large-scale search, I worry
that it wouldnt be relevance testing but very specific tuning. In practice
this is the kind of thing where you would just manually fudge these things
anyway... or am I reading you wrong?

> 8: URLs as keys (optional, handy in some contexts)
> Various technical issues here, just wanted to bring it up.
> 9: Ideas for an "(e)valuation console" / "crowd sourcing"
> There are several ways to present searches and answers to users in a
> somewhat reasonable way, to make it a bit easier / fun for them to make
> assertions.  Lots of ways to go here, but we'd need some UI resources.

I've been thinking about  this some lately too, at least we could start with
apache lucene-related mail archives or something. I know this would be a
wierd collection with very specific bias (all the code and everything) but
it would still allow us to start creating some kind of framework to
crowdsource judgements...

> 10: "academic" vs. "real-world" focus, can't we serve both!?
> Some areas of search R&D aren't applicable to real world / commercial
> usage.  TREC is a perfect example of this.  Some open source licenses also
> prevent commercial participation.  And I can imagine some testing standards
> that, while very well thought out and thorough, might be impractical to
> actually use.
> I really think we can serve both groups, and will get better results for
> our efforts.

I think for starters, our primary focus should be to support improvements of
apache lucene-related projects. Then we can expand later to other things
(ultimately as an open source project it would be best if it had wide use I

> 11: Task appropriateness
> There are different tasks that folks might want to use Relevancy Testing
> Tools for:
> * Engine A vs. Engine B
> * Configuration A vs. Configuration B (same engine)
> * "normal variable" vs. "acceptable" vs. "unacceptable"
> Etc.
> We should keep these different use cases in mind.

Yeah this is one thing where we need to do more work. Currently we only
output reformatted things to use a single engine (lucene benchmark package,
although you could write code for something else to use this stuff too). As
far as changing configuration, right now again you are stuck with whatever
configuration params can be changed inside lucene benchmark pkg... at index
time this is whatever you can do in a .alg file, for example.

> 12: Clusters and Relevancy grading:  Do you agree with the following
> assertion:
> If you manually cluster documents by subject, then presumably using one of
> those documents (perhaps a shorter one), you'd expect it to generally find
> the other documents in that cluster, presumably at a higher relevance than
> other documents from other clusters.
> *If* this were true, it suggests some automated testing methods.
> I'll say that I don't think it's entirely true, but I think it's one
> technique to keep in mind, for some use cases.  This by itself is probably a
> long discussion.

> 13: Problems with measurements...
> Just listing some of the stuff I've been worried about:
> * Individual opinion drift (me before coffee on test # 10 vs. me after 8
> cups of coffee on test # 500; if I go back to test 10, would I grade it the
> same)
* Tester variance (how closely would 2 coworkers grade the same search
> against the same docset)

more queries, more judges is the solution to this in my opinion.

> * Language drift - if I translate both the questions and searches into
> French, then have a French speaker evaluate the results, how close should I
> expect them to be?

what do you mean, where the corpus is already multilingual (i.e. parallel
text) ?

> * Ordering drift - I've seen this myself, you mind and habits can change as
> you go through many tests, also sorted vs. unsorted data
> And if using clustering as part of your testing:
> * Cluster drifts - cluster started out as "Windows", but as docs are added
> becomes more about "Windows installation and drivers", etc
> * Cluster spans - a small cluster might be about Microsoft Office
> applications, but one particular document is about ease of use of Powerpoint
> 2007 vs. another about problems installing Office on a Mac; but other 3 docs
> in the cluster gradually span this gamut
> Cluster split / merge - Windows cluster is now about "Windows installation"
> vs. "Windows applications"
> 14: Long tail / variability / Poisson issues
> Sample A is 1,000 test docs and 100 searches from a particular web site.
> Sample B is another 1,000 test docs and 100 searches (non-overlapping) from
> the same web site, over the same period of time
> Sample C is 10,000 test docs and 1,000 searches (also not overlapping, and
> from the same time-frame / site)
> You'll see very common themes in all sets of docs and searches, that will
> clearly overlap.
> However, the long tail quickly gets into 1 item samples, and you'll find
> the 2 tails do not overlap.  This has something to do with testing... but is
> probably a long subject for another day.
> And since sample C is 10x samples A and B, what variance can be explained
> simply due to that fact?  For example, is the tail 10x longer, or maybe
> sqrt(10) longer?  etc.
> 15: Participation in ORP
> As some of you know we're active in, plus our newsletter,
> plus we work with Lucid on webinars sometimes.  So there's a bunch of ways
> we could publicize this group, when we're ready.
> Any of you Bay Area?

right now, but usually not :) As far as participation goes, if you have
things to contribute feel free to jump on in at openrelevance-dev, we could
use the help!! We have barely even started so feel free to wreak havoc on
whatever is there already, etc etc. It seems like you have lots of good

> And should we "take on TREC" ?

I think what we are doing is different. its not about taking anyone on, its
about having something open available, even if right now its not yet as
good, its better than nothing.

> --
> Mark Bennett / New Idea Engineering, Inc. /
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> On Tue, Dec 22, 2009 at 3:45 PM, Robert Muir <> wrote:
>> Hello,
>> I wasn't sure what you were asking, do you already have your own
>> corpus/queries/judgements and are looking for something similar to
>> trec_eval?
>> If so, the lucene-java benchmark 'quality' package can already take
>> queries and judgements files, and run the experiment against a lucene index,
>> and finally produce "similar" summary output.
>> On Tue, Dec 22, 2009 at 5:01 PM, Mark Bennett <>wrote:
>>> It's a small world!  I downloaded the TREC C code today, but noticed the
>>> non-commercial use copyright.
>>> I went to go find some open source qrels code, and found you guys already
>>> hard at work.
>>> So.... where's the code?  :-)
>>> --
>>> Mark Bennett / New Idea Engineering, Inc. /
>>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>> --
>> Robert Muir

Robert Muir

View raw message