lucene-openrelevance-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Re. Lots to talk about, ORP
Date Thu, 24 Dec 2009 11:46:22 GMT

On Dec 23, 2009, at 7:27 PM, Robert Muir wrote:

> Hey Mark, just some quick replies below.
> 
> On Wed, Dec 23, 2009 at 5:49 PM, Mark Bennett <mbennett@ideaeng.com> wrote:
> 
> Are you guys on board with this?  There were comments like "First and foremost, this
project is a way for Lucene to talk about relevance in a standard way..." and "I think for
starters, our primary focus should be to support improvements of apache lucene-related projects.
Then we can expand later... "
> 
> 
> I should  reword this, as Grant said... scratch your own itch. If you want to help support
another search engine, fantastic! I did some very minimal work so lucene-java could run relevance
tests, so that was my itch. But please don't let this discourage you from supporting search
engine XYZ.

+1.  This is how open source works.  The overall goal of the project is to be able to judge
relevance for a search engine in an open way.  I personally won't be building any tools for
things other than Lucene/Solr/Mahout (yes, I think we can use these same corpora for machine
learning too!), but there's no reason others can't.  We'll just need to properly structure
things in the SVN for the various code points.

> 
> If we push that too hard, we'll scare away folks from other communities.  I agree that
people should each scratch their worst itch, I think it's in part a question of positioning.
 Solr and Nutch are very heavily associated with Lucene, which is understandable.  But virtually
every client we work with has multiple engines, so we have a bit of a different itch I guess.
> 
> we welcome any patches to support these additional search engines... I mean, we can't
even run tests against things like solr yet.... (which would also be cool)

I hope to have some time on that, but others should jump in too.

> 
> 3: Multiple languages are good, even though some of the early content has been selected
more because it was available.  English might be a strategic language to get covered early.
 I'd really like to see a parallel set of test documents and searches in multiple languages;
that's what my client is having to build.
> 
> +1.  It would be great to consider using parallel text to make it easier to support many
languages, although it might require us to have some different search domains (but I think
this is ok?). 

We could do CLIR.  I love it.  Going back to my roots!


> 
> We have limited resources here so I think this kinda of thing is interesting, english
has been done time and time again and while I still think its important, what if we can build
a multilingual relevance corpus with only small additional effort. Yes, I realize this kind
of approach probably wouldn't be as accurate as building individual collections for each language,
but its probably very close.
> 
> Can we consider something like http://www.statmt.org/europarl/ ???
> More open parallel corpora available here: http://urd.let.rug.nl/tiedeman/OPUS/
> 
> I mentioned europarl, even though it has less languages than say http://langtech.jrc.it/JRC-Acquis.html,
 because especially interesting is the note at the bottom: We are not aware of any copyright
restrictions of the material.
> 
> If there is no problem with this, I'd like to help. supporting more languages is my itch.
> -- 
> Robert Muir
> rcmuir@gmail.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Mime
View raw message