From general-return-1313-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Wed May 13 15:57:31 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 62664 invoked from network); 13 May 2009 15:57:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 May 2009 15:57:31 -0000 Received: (qmail 14973 invoked by uid 500); 13 May 2009 15:57:31 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 14887 invoked by uid 500); 13 May 2009 15:57:30 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 14877 invoked by uid 99); 13 May 2009 15:57:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 15:57:30 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.83] (HELO spunkymail-a16.g.dreamhost.com) (208.97.132.83) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 15:57:19 +0000 Received: from [192.168.0.102] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a16.g.dreamhost.com (Postfix) with ESMTP id DB9287B3B8 for ; Wed, 13 May 2009 08:56:58 -0700 (PDT) Message-Id: <62B0E82C-14D6-48DD-A11A-7B78F5A97268@apache.org> From: Grant Ingersoll To: general@lucene.apache.org In-Reply-To: <32E87E9E-104A-44CF-A894-426086844EB2@apache.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Open Relevance Project? Date: Wed, 13 May 2009 11:56:57 -0400 References: <951F29A3-4461-4B25-A79D-3D1493305089@apache.org> <9ac0c6aa0905111301g53c1a788p8524e1e98b1e234b@mail.gmail.com> <32E87E9E-104A-44CF-A894-426086844EB2@apache.org> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org So, I suppose the next steps are to formalize this project a little more. I'll call a vote on a separate thread to add it as a Lucene sub. I figured I would contact infrastructure to see what they think. Was also thinking that maybe we should talk with iBiblio or some other content repository to see if they can help overcome the bandwidth problem. -Grant On May 11, 2009, at 6:12 PM, Grant Ingersoll wrote: > > On May 11, 2009, at 4:01 PM, Michael McCandless wrote: > >> I'd love to see a resource like this (it's high time!), and I'll try >> to help when/where I can, starting with some initial >> comments/questions: >> >> I think it's actually quite a challenge to do well. EG it's easy to >> make a corpus that's too easy because it's highly diverse (and thus >> most search engines have no trouble pulling back relevant results). >> Instead, I think the content set should be well & tightly scoped to a >> certain topic, and not necessarily that large (ie we don't need a >> huge >> number of documents). It would help if that scoping is towards >> content that many people find "of interest" so we get "accurate" >> judgements by as wide an audience as possible. > > I think we will want a generic one, and then focused ones, but we > should start with generic at first. > >> >> >> EG how about coverage of the 2009 H1N1 outbreak (that's licensed >> appropriately)? Or... the 2008 US presidential election? Or... >> research on Leukemia (but I fear such content is not typically >> licensed appropriately, nor will it have wide interest). >> >> What does "using Nutch to crawl Creative Commons" actually mean? Can >> I browse the content that's being crawled? > > Nutch has a CC plugin that allows it to filter out non-CC content, > AIUI. > >> >> >> Also, to help us build up the relevance judgements, I think we should >> build a basic custom app for collecting queries as well as annotating >> them. I should be able to go to that page and run my own queries, >> which are collected. Then, I should be able to browse previously >> collected queries, click on them, and add my own judgement. The site >> should try to offer up queries that are "in need" of judgements. It >> should run the search and let me step through the results, marking >> those that are relevant; but we would then bias the results to that >> search engine; maybe under the hood we rotate through search engines >> each time? >> >> Do we have anyone involved who's built similar corpora before? Or >> has >> anyone read papers on how prior corpora were designed/created? > > This is all good, but here I'm thinking simpler, at least at first. > I don't know that we need to be writing apps, although feel free, > since it is O/S after all. :-) I was wondering if we couldn't > handle this wiki style (how is still not clear) whereby we simply > have pages that contain the queries and judgments and over time the > wisdom of the crowds will work to maintain standards, fill in gaps, > etc. Maybe, in regards to judgments, we allow people to vote for > them, which over time will yield an appropriate result (but is > subject to early issues). Not sure what all that means just yet, > but the wiki approach allows us to get going with minimal resources > while still delivering value. Hmm, now it's starting to sound like > an app... ;-) > > As opposed to TREC style stuff, I don't think we need the top 1000 > (although it could work). Just the top ten or twenty. Sometimes, > it can even be useful to just rate a whole page of results at once, > even at the cost of granularity. Basically, what I'm proposing we > do is carry out a pragmatic relevance test out in the open, just as > people should do in house. I think this fits with Lucene's model of > operation quite well: be practical by focusing on real data and real > feedback as opposed to obsessing over theory. (Not that you were > suggesting otherwise, I'm just stating it) > > I need to find the reference, but I recall the last edition of SIGIR > having a discussion on crowdsourcing relevance judgments. > > -Grant -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search