From general-return-1306-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Mon May 11 20:03:03 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 81915 invoked from network); 11 May 2009 20:01:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 May 2009 20:01:32 -0000 Received: (qmail 30199 invoked by uid 500); 11 May 2009 20:01:32 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 30145 invoked by uid 500); 11 May 2009 20:01:31 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 30135 invoked by uid 99); 11 May 2009 20:01:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2009 20:01:31 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.217.167] (HELO mail-gx0-f167.google.com) (209.85.217.167) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2009 20:01:23 +0000 Received: by gxk11 with SMTP id 11so5940703gxk.5 for ; Mon, 11 May 2009 13:01:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.51.9 with SMTP id y9mr7049540yby.292.1242072062281; Mon, 11 May 2009 13:01:02 -0700 (PDT) In-Reply-To: <951F29A3-4461-4B25-A79D-3D1493305089@apache.org> References: <951F29A3-4461-4B25-A79D-3D1493305089@apache.org> Date: Mon, 11 May 2009 16:01:02 -0400 Message-ID: <9ac0c6aa0905111301g53c1a788p8524e1e98b1e234b@mail.gmail.com> Subject: Re: Open Relevance Project? From: Michael McCandless To: general@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I'd love to see a resource like this (it's high time!), and I'll try to help when/where I can, starting with some initial comments/questions: I think it's actually quite a challenge to do well. EG it's easy to make a corpus that's too easy because it's highly diverse (and thus most search engines have no trouble pulling back relevant results). Instead, I think the content set should be well & tightly scoped to a certain topic, and not necessarily that large (ie we don't need a huge number of documents). It would help if that scoping is towards content that many people find "of interest" so we get "accurate" judgements by as wide an audience as possible. EG how about coverage of the 2009 H1N1 outbreak (that's licensed appropriately)? Or... the 2008 US presidential election? Or... research on Leukemia (but I fear such content is not typically licensed appropriately, nor will it have wide interest). What does "using Nutch to crawl Creative Commons" actually mean? Can I browse the content that's being crawled? Also, to help us build up the relevance judgements, I think we should build a basic custom app for collecting queries as well as annotating them. I should be able to go to that page and run my own queries, which are collected. Then, I should be able to browse previously collected queries, click on them, and add my own judgement. The site should try to offer up queries that are "in need" of judgements. It should run the search and let me step through the results, marking those that are relevant; but we would then bias the results to that search engine; maybe under the hood we rotate through search engines each time? Do we have anyone involved who's built similar corpora before? Or has anyone read papers on how prior corpora were designed/created? Mike On Mon, May 11, 2009 at 12:07 PM, Grant Ingersoll wro= te: > A few of us who are interested in an Open Relevance assessment project (a= la > TREC) have started to put some thoughts down on "paper" over at > http://wiki.apache.org/lucene-java/OpenRelevance > > Thus, if you'd like to somehow participate (TBD what that actually means > just yet) in developing a set of open collections, queries and assessment= s > for relevance testing, let's discuss here and on that Wiki page. > > The basic gist of it is, we'd like to crawl Creative Commons and/or other > free content, redistribute it along with queries and judgments, thus fuel= ing > the testing capabilities to further improve Lucene's search quality as we= ll > as, of course, providing the means for a completely open assessment proce= ss > whereby anyone can participate without having to fork up money to license= 20 > year old copyrighted news articles that are of no other value whatsoever > other than testing. > > At this point, we're open to a lot of ideas. =A0Once we solidify a bit, t= hen > we'd like to make it an official Lucene subproject and get our own resour= ces > as well as figure out how to crawl and host the content using ASF > infrastructure (without making the ASF infra. team upset!) > > Cheers, > Grant >