lucene-openrelevance-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Re. Lots to talk about, ORP
Date Thu, 24 Dec 2009 00:27:20 GMT
Hey Mark, just some quick replies below.

On Wed, Dec 23, 2009 at 5:49 PM, Mark Bennett <mbennett@ideaeng.com> wrote:

>
> Are you guys on board with this?  There were comments like "First and
> foremost, this project is a way for Lucene to talk about relevance in a
> standard way..." and "I think for starters, our primary focus should be to
> support improvements of apache lucene-related projects. Then we can expand
> later... "
>
>
I should  reword this, as Grant said... scratch your own itch. If you want
to help support another search engine, fantastic! I did some very minimal
work so lucene-java could run relevance tests, so that was my itch. But
please don't let this discourage you from supporting search engine XYZ.

>
> If we push that too hard, we'll scare away folks from other communities.  I
> agree that people should each scratch their worst itch, I think it's in part
> a question of positioning.  Solr and Nutch are very heavily associated with
> Lucene, which is understandable.  But virtually every client we work with
> has multiple engines, so we have a bit of a different itch I guess.
>

we welcome any patches to support these additional search engines... I mean,
we can't even run tests against things like solr yet.... (which would also
be cool)

>
> 3: Multiple languages are good, even though some of the early content has
> been selected more because it was available.  English might be a strategic
> language to get covered early.  I'd really like to see a parallel set of
> test documents and searches in multiple languages; that's what my client is
> having to build.
>

+1.  It would be great to consider using parallel text to make it easier to
support many languages, although it might require us to have some different
search domains (but I think this is ok?).

We have limited resources here so I think this kinda of thing is
interesting, english has been done time and time again and while I still
think its important, what if we can build a multilingual relevance corpus
with only small additional effort. Yes, I realize this kind of approach
probably wouldn't be as accurate as building individual collections for each
language, but its probably very close.

Can we consider something like http://www.statmt.org/europarl/ ???
More open parallel corpora available here:
http://urd.let.rug.nl/tiedeman/OPUS/

I mentioned europarl, even though it has less languages than say
http://langtech.jrc.it/JRC-Acquis.html,  because especially interesting is
the note at the bottom: We are not aware of any copyright restrictions of
the material.

If there is no problem with this, I'd like to help. supporting more
languages is my itch.
-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message