Return-Path: Delivered-To: apmail-lucene-openrelevance-user-archive@minotaur.apache.org Received: (qmail 4687 invoked from network); 23 Dec 2009 20:55:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Dec 2009 20:55:53 -0000 Received: (qmail 19753 invoked by uid 500); 23 Dec 2009 20:55:53 -0000 Delivered-To: apmail-lucene-openrelevance-user-archive@lucene.apache.org Received: (qmail 19712 invoked by uid 500); 23 Dec 2009 20:55:53 -0000 Mailing-List: contact openrelevance-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: openrelevance-user@lucene.apache.org Delivered-To: mailing list openrelevance-user@lucene.apache.org Received: (qmail 19702 invoked by uid 99); 23 Dec 2009 20:55:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Dec 2009 20:55:53 +0000 X-ASF-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE,PLING_QUERY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gsiasf@gmail.com designates 209.85.217.225 as permitted sender) Received: from [209.85.217.225] (HELO mail-gx0-f225.google.com) (209.85.217.225) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Dec 2009 20:55:44 +0000 Received: by gxk25 with SMTP id 25so852056gxk.5 for ; Wed, 23 Dec 2009 12:55:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:from:mime-version :content-type:subject:date:in-reply-to:to:references:message-id :x-mailer; bh=7WUL8M9/0RlKD5jfbnwvQtqhvSU/1LySeuYe8EA53CM=; b=ey5mKRSdQzlr7DbyWpzYi9jssan4HWlaj2REadpHZ1ZBaK1E5rhABGFzPqkRAyiL6F MSmtIaWPVN4mBDHD6ljohtevIZmDSEzb3b1w7TpGQ8cOcbp8Wwvhctj2yiA/UwQ8SMg6 xGBNt91ytx7dTw+5augKW+mQua0QyPS+pqdLE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer; b=JLefCrp2uuy1/CTFOH8bilRdzcrCaiRESEKS+HH9Wi5cq1zGg+ybkVgoaM2JYczFaS HvsJUGaOEK0T3E52787T5m9oP839svv00LMt0RozWb709qdbweBoRw/KT+6JWVseZMnl pz/X/5AMCJF8pVHf1AwAwnU4jPLtxD2LGqReU= Received: by 10.90.6.16 with SMTP id 16mr9445646agf.55.1261601719790; Wed, 23 Dec 2009 12:55:19 -0800 (PST) Received: from ?10.0.0.77? (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by mx.google.com with ESMTPS id 16sm4046557gxk.15.2009.12.23.12.55.18 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 23 Dec 2009 12:55:19 -0800 (PST) Sender: Grant Ingersoll From: Grant Ingersoll Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: multipart/alternative; boundary=Apple-Mail-36--173324882 Subject: Re: Lots to talk about! (was Re: Any Java/ASF qrels code yet?) Date: Wed, 23 Dec 2009 15:55:16 -0500 In-Reply-To: <3504767f0912231137w571ba3aat9373057ed84b8c9f@mail.gmail.com> To: openrelevance-user@lucene.apache.org References: <3504767f0912231137w571ba3aat9373057ed84b8c9f@mail.gmail.com> Message-Id: <027ACA77-0342-47B7-BBDA-F33205B1EE50@apache.org> X-Mailer: Apple Mail (2.1077) --Apple-Mail-36--173324882 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Dec 23, 2009, at 2:37 PM, Mark Bennett wrote: > Hello Simon and Robert, >=20 > Robert, yes, I do have a private corpus and truth table. At this = point I can't share it, though I'll ask my client at some point. >=20 > I did find some code in JiRA, your patches, including the links here = "for the record": > https://issues.apache.org/jira/browse/ORP-1 > https://issues.apache.org/jira/browse/ORP-2 >=20 > Top languages for testing are English, Japanese, French and German. >=20 > I'm exicited to have others to talk to! I have some general comments = / questions: >=20 > 1: Although the qrels data format was originally binary yes/no, = apparently there were more flexible dialects used in later years, that = allowed for some weighting. Was there a particular dialect that y'all = were considering? I think it would be nice to have both binary and something like: = relevant, somewhat relevant, not relevant, embarrassing or a scale of = 1-5 or 1-10 depending on how hard core you want to be. >=20 > 2: CAN WE use the TREC qrels format(s)? >=20 > I believe TREC has various restrictions on the use of test results, = source content and evaluation code (annoying since TREC is supposed to = foster research and NIST is paid for by US tax dollars, but that's = another whole rant) But do we think the file format is "open" or = "closed" ? We should be able to use the format. I think the only thing closed = about TREC is the need to pay a small sum for the collection, but that = isn't NIST's fault. >=20 > 3: I do favor an optional "analog" scale. Do you agree? >=20 > Our assertions are on an A-F scale, I can elaborate if you're = interested. A floating point scale is more precise perhaps, but we have = human graders, and explain letter grades that approximate academic = rankings was less confusing, plus we were already using numbers in two = other aspects of the grading form. >=20 > 4: Generally do you guys favor a simple file format (one line per = record), or an XML format? >=20 > TREC was born in the early 90's I guess, so is record oriented, and = probably more efficient. We have our tests in an XML format, which = though more verbose, affords a lot more flexibility including comments = and optionally self-contained content. It also sidesteps encoding as = XML is UTF-8. I've found that "text files" from other countries tend to = be numerous encodings. And Excel, which is often used for CSV and other = delimited files, sadly does NOT do UTF-8 in CSV files. Pretty wide open at this point >=20 > 5: How important do you value interoperability with Excel? >=20 > It's VERY handy for non-techies, and the xlsX format is a set of = zipped XML files, so perhaps acceptably "open". I would not propose = .xlsx as the standard format, but it'd be nice to inter-operate with it. = We'd need some type of template. That would be great. >=20 > 6: "quiescent" vs. "dynamic" strategies >=20 > Content: During in house testing it's sometimes been hard to maintain = a static set of content. You can have a separate system, but I suspect = in some scenarios it won't be feasible to lock down the content. See = item 10 below. My suggestion is to mix this into the thinking. Some = researches wouldn't accept the variables it adds, but for others if it's = a choice between imperfect checks and no checks at all, they'll take the = imperfect. >=20 > Grades / Evaluations: It's VERY hard to get folks to grade an MxN = matrix. I had a matrix of just 57 x 25 (> 1,400 spots) and, trust me, = it's hard to do in one sitting. It'd be nice to handle spotty = valuations. I think it's pretty important to be able to reproduce experiments across = users/machines/etc. which means the content needs to be versioned. This = is the one big issue I have w/ simply pointing at other data sets. = Ultimately, we will need our own collection that we can version. >=20 > 7: fuzzy evaluations vs. "unit testing" >=20 > Given the variabilities (covered in other points), it'd be nice to = come up with fuzzier assertions. >=20 > Examples: > * "Doc1 is more relevant to Search1 than Doc2" > * "I'd like to see at least 3 of these docs in the top 10 matches for = this search" Nice to have, but likely further down the road. However, the door is = wide open at this point, so scratch that itch! >=20 > 8: URLs as keys (optional, handy in some contexts) > Various technical issues here, just wanted to bring it up. >=20 > 9: Ideas for an "(e)valuation console" / "crowd sourcing" >=20 > There are several ways to present searches and answers to users in a = somewhat reasonable way, to make it a bit easier / fun for them to make = assertions. Lots of ways to go here, but we'd need some UI resources. Yep, this has been kicked around and would be quite nice. >=20 > 10: "academic" vs. "real-world" focus, can't we serve both!? >=20 > Some areas of search R&D aren't applicable to real world / commercial = usage. TREC is a perfect example of this. Some open source licenses = also prevent commercial participation. And I can imagine some testing = standards that, while very well thought out and thorough, might be = impractical to actually use. It's an open source project at Apache, so anybody who is fine w/ the ASL = can participate. Meaning both academics and commercial companies. = Frankly, I don't care much about p@1000, but p@5 and p@10 are quite = interesting, so I tend to be more real world focused, but a health cross = fertilization will be great. >=20 > I really think we can serve both groups, and will get better results = for our efforts. >=20 > 11: Task appropriateness >=20 > There are different tasks that folks might want to use Relevancy = Testing Tools for: > * Engine A vs. Engine B > * Configuration A vs. Configuration B (same engine) > * "normal variable" vs. "acceptable" vs. "unacceptable" > Etc. >=20 > We should keep these different use cases in mind. Indeed, not to preclude others, but I know I'm focused on how to use it = for Lucene/Mahout etc. In other words, the latter two in the list. If = other vendors want to participate, that is great too. All are welcome. = Still, it's pretty hard to really do Engine A vs. Engine B tests in a = fair way. >=20 > 12: Clusters and Relevancy grading: Do you agree with the following = assertion: >=20 > If you manually cluster documents by subject, then presumably using = one of those documents (perhaps a shorter one), you'd expect it to = generally find the other documents in that cluster, presumably at a = higher relevance than other documents from other clusters. >=20 > *If* this were true, it suggests some automated testing methods. >=20 > I'll say that I don't think it's entirely true, but I think it's one = technique to keep in mind, for some use cases. This by itself is = probably a long discussion. Maybe. Many clustering algorithms calculate distance much the same way = that the engine scores, so it may just be a case of self-fulfilling = prophesy. >=20 > 13: Problems with measurements... >=20 > Just listing some of the stuff I've been worried about: > * Individual opinion drift (me before coffee on test # 10 vs. me after = 8 cups of coffee on test # 500; if I go back to test 10, would I grade = it the same) > * Tester variance (how closely would 2 coworkers grade the same search = against the same docset) > * Language drift - if I translate both the questions and searches into = French, then have a French speaker evaluate the results, how close = should I expect them to be? > * Ordering drift - I've seen this myself, you mind and habits can = change as you go through many tests, also sorted vs. unsorted data >=20 > And if using clustering as part of your testing: > * Cluster drifts - cluster started out as "Windows", but as docs are = added becomes more about "Windows installation and drivers", etc > * Cluster spans - a small cluster might be about Microsoft Office = applications, but one particular document is about ease of use of = Powerpoint 2007 vs. another about problems installing Office on a Mac; = but other 3 docs in the cluster gradually span this gamut > Cluster split / merge - Windows cluster is now about "Windows = installation" vs. "Windows applications" Reproducibility is paramount. One of the biggest issues w/ these types = of evaluations is the problem of managing the output of the tests and = keeping track of them. I imagine we'll develop tools for that, too. >=20 > 14: Long tail / variability / Poisson issues >=20 > Sample A is 1,000 test docs and 100 searches from a particular web = site. > Sample B is another 1,000 test docs and 100 searches (non-overlapping) = from the same web site, over the same period of time > Sample C is 10,000 test docs and 1,000 searches (also not overlapping, = and from the same time-frame / site) >=20 > You'll see very common themes in all sets of docs and searches, that = will clearly overlap. >=20 > However, the long tail quickly gets into 1 item samples, and you'll = find the 2 tails do not overlap. This has something to do with = testing... but is probably a long subject for another day. >=20 > And since sample C is 10x samples A and B, what variance can be = explained simply due to that fact? For example, is the tail 10x longer, = or maybe sqrt(10) longer? etc. >=20 > 15: Participation in ORP >=20 > As some of you know we're active in SearchDev.org, plus our = newsletter, plus we work with Lucid on webinars sometimes. So there's a = bunch of ways we could publicize this group, when we're ready. Sure, the more the merrier. Getting the word out is important, as is = setting expectations on what they will find once they arrive. >=20 > Any of you Bay Area? Sometimes >=20 > And should we "take on TREC" ? See other response. -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: = http://www.lucidimagination.com/search --Apple-Mail-36--173324882 Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=us-ascii
On Dec 23, 2009, at 2:37 PM, Mark Bennett wrote:

Hello Simon and Robert,

Robert, yes, I do have a private corpus and truth table.  At this point I can't share it, though I'll ask my client at some point.

I did find some code in JiRA, your patches, including the links here "for the record":
https://issues.apache.org/jira/browse/ORP-1
https://issues.apache.org/jira/browse/ORP-2

Top languages for testing are English, Japanese, French and German.

I'm exicited to have others to talk to!  I have some general comments / questions:

1: Although the qrels data format was originally binary yes/no, apparently there were more flexible dialects used in later years, that allowed for some weighting.  Was there a particular dialect that y'all were considering?

I think it would be nice to have both binary and something like: relevant, somewhat relevant, not relevant, embarrassing or a scale of 1-5 or 1-10 depending on how hard core you want to be.


2: CAN WE use the TREC qrels format(s)?

I believe TREC has various restrictions on the use of test results, source content and evaluation code (annoying since TREC is supposed to foster research and NIST is paid for by US tax dollars, but that's another whole rant)  But do we think the file format is "open" or "closed" ?

We should be able to use the format.  I think the only thing closed about TREC is the need to pay a small sum for the collection, but that isn't NIST's fault.


3: I do favor an optional "analog" scale.  Do you agree?

Our assertions are on an A-F scale, I can elaborate if you're interested.  A floating point scale is more precise perhaps, but we have human graders, and explain letter grades that approximate academic rankings was less confusing, plus we were already using numbers in two other aspects of the grading form.

4: Generally do you guys favor a simple file format (one line per record), or an XML format?

TREC was born in the early 90's I guess, so is record oriented, and probably more efficient.  We have our tests in an XML format, which though more verbose, affords a lot more flexibility including comments and optionally self-contained content.  It also sidesteps encoding as XML is UTF-8.  I've found that "text files" from other countries tend to be numerous encodings.  And Excel, which is often used for CSV and other delimited files, sadly does NOT do UTF-8 in CSV files.

Pretty wide open at this point


5: How important do you value interoperability with Excel?

It's VERY handy for non-techies, and the xlsX format is a set of zipped XML files, so perhaps acceptably "open".  I would not propose .xlsx as the standard format, but it'd be nice to inter-operate with it.  We'd need some type of template.

That would be great.


6: "quiescent" vs. "dynamic" strategies

Content: During in house testing it's sometimes been hard to maintain a static set of content.  You can have a separate system, but I suspect in some scenarios it won't be feasible to lock down the content.  See item 10 below.  My suggestion is to mix this into the thinking.  Some researches wouldn't accept the variables it adds, but for others if it's a choice between imperfect checks and no checks at all, they'll take the imperfect.

Grades / Evaluations: It's VERY hard to get folks to grade an MxN matrix.  I had a matrix of just 57 x 25 (> 1,400 spots) and, trust me, it's hard to do in one sitting.  It'd be nice to handle spotty valuations.

I think it's pretty important to be able to reproduce experiments across users/machines/etc. which means the content needs to be versioned.  This is the one big issue I have w/ simply pointing at other data sets.  Ultimately, we will need our own collection that we can version.


7: fuzzy evaluations vs. "unit testing"

Given the variabilities (covered in other points), it'd be nice to come up with fuzzier assertions.

Examples:
* "Doc1 is more relevant to Search1 than Doc2"
* "I'd like to see at least 3 of these docs in the top 10 matches for this search"

Nice to have, but likely further down the road.  However, the door is wide open at this point, so scratch that itch!



8: URLs as keys (optional, handy in some contexts)
Various technical issues here, just wanted to bring it up.

9: Ideas for an "(e)valuation console" / "crowd sourcing"

There are several ways to present searches and answers to users in a somewhat reasonable way, to make it a bit easier / fun for them to make assertions.  Lots of ways to go here, but we'd need some UI resources.

Yep, this has been kicked around and would be quite nice.


10: "academic" vs. "real-world" focus, can't we serve both!?

Some areas of search R&D aren't applicable to real world / commercial usage.  TREC is a perfect example of this.  Some open source licenses also prevent commercial participation.  And I can imagine some testing standards that, while very well thought out and thorough, might be impractical to actually use.

It's an open source project at Apache, so anybody who is fine w/ the ASL can participate.  Meaning both academics and commercial companies.  Frankly, I don't care much about p@1000, but p@5 and p@10 are quite interesting, so I tend to be more real world focused, but a health cross fertilization will be great.


I really think we can serve both groups, and will get better results for our efforts.

11: Task appropriateness

There are different tasks that folks might want to use Relevancy Testing Tools for:
* Engine A vs. Engine B
* Configuration A vs. Configuration B (same engine)
* "normal variable" vs. "acceptable" vs. "unacceptable"
Etc.

We should keep these different use cases in mind.

Indeed, not to preclude others, but I know I'm focused on how to use it for Lucene/Mahout etc.  In other words, the latter two in the list.  If other vendors want to participate, that is great too.  All are welcome.  Still, it's pretty hard to really do Engine A vs. Engine B tests in a fair way.



12: Clusters and Relevancy grading:  Do you agree with the following assertion:

If you manually cluster documents by subject, then presumably using one of those documents (perhaps a shorter one), you'd expect it to generally find the other documents in that cluster, presumably at a higher relevance than other documents from other clusters.

*If* this were true, it suggests some automated testing methods.

I'll say that I don't think it's entirely true, but I think it's one technique to keep in mind, for some use cases.  This by itself is probably a long discussion.

Maybe.  Many clustering algorithms calculate distance much the same way that the engine scores, so it may just be a case of self-fulfilling prophesy.


13: Problems with measurements...

Just listing some of the stuff I've been worried about:
* Individual opinion drift (me before coffee on test # 10 vs. me after 8 cups of coffee on test # 500; if I go back to test 10, would I grade it the same)
* Tester variance (how closely would 2 coworkers grade the same search against the same docset)
* Language drift - if I translate both the questions and searches into French, then have a French speaker evaluate the results, how close should I expect them to be?
* Ordering drift - I've seen this myself, you mind and habits can change as you go through many tests, also sorted vs. unsorted data

And if using clustering as part of your testing:
* Cluster drifts - cluster started out as "Windows", but as docs are added becomes more about "Windows installation and drivers", etc
* Cluster spans - a small cluster might be about Microsoft Office applications, but one particular document is about ease of use of Powerpoint 2007 vs. another about problems installing Office on a Mac; but other 3 docs in the cluster gradually span this gamut
Cluster split / merge - Windows cluster is now about "Windows installation" vs. "Windows applications"

Reproducibility is paramount.  One of the biggest issues w/ these types of evaluations is the problem of managing the output of the tests and keeping track of them.  I imagine we'll develop tools for that, too.



14: Long tail / variability / Poisson issues

Sample A is 1,000 test docs and 100 searches from a particular web site.
Sample B is another 1,000 test docs and 100 searches (non-overlapping) from the same web site, over the same period of time
Sample C is 10,000 test docs and 1,000 searches (also not overlapping, and from the same time-frame / site)

You'll see very common themes in all sets of docs and searches, that will clearly overlap.

However, the long tail quickly gets into 1 item samples, and you'll find the 2 tails do not overlap.  This has something to do with testing... but is probably a long subject for another day.

And since sample C is 10x samples A and B, what variance can be explained simply due to that fact?  For example, is the tail 10x longer, or maybe sqrt(10) longer?  etc.

15: Participation in ORP

As some of you know we're active in SearchDev.org, plus our newsletter, plus we work with Lucid on webinars sometimes.  So there's a bunch of ways we could publicize this group, when we're ready.

Sure, the more the merrier.  Getting the word out is important, as is setting expectations on what they will find once they arrive.


Any of you Bay Area?

Sometimes


And should we "take on TREC" ?

See other response.


--------------------------
Grant Ingersoll

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

--Apple-Mail-36--173324882--