lucene-openrelevance-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Re. Lots to talk about, ORP
Date Thu, 24 Dec 2009 00:27:20 GMT
Hey Mark, just some quick replies below.

On Wed, Dec 23, 2009 at 5:49 PM, Mark Bennett <> wrote:

> Are you guys on board with this?  There were comments like "First and
> foremost, this project is a way for Lucene to talk about relevance in a
> standard way..." and "I think for starters, our primary focus should be to
> support improvements of apache lucene-related projects. Then we can expand
> later... "
I should  reword this, as Grant said... scratch your own itch. If you want
to help support another search engine, fantastic! I did some very minimal
work so lucene-java could run relevance tests, so that was my itch. But
please don't let this discourage you from supporting search engine XYZ.

> If we push that too hard, we'll scare away folks from other communities.  I
> agree that people should each scratch their worst itch, I think it's in part
> a question of positioning.  Solr and Nutch are very heavily associated with
> Lucene, which is understandable.  But virtually every client we work with
> has multiple engines, so we have a bit of a different itch I guess.

we welcome any patches to support these additional search engines... I mean,
we can't even run tests against things like solr yet.... (which would also
be cool)

> 3: Multiple languages are good, even though some of the early content has
> been selected more because it was available.  English might be a strategic
> language to get covered early.  I'd really like to see a parallel set of
> test documents and searches in multiple languages; that's what my client is
> having to build.

+1.  It would be great to consider using parallel text to make it easier to
support many languages, although it might require us to have some different
search domains (but I think this is ok?).

We have limited resources here so I think this kinda of thing is
interesting, english has been done time and time again and while I still
think its important, what if we can build a multilingual relevance corpus
with only small additional effort. Yes, I realize this kind of approach
probably wouldn't be as accurate as building individual collections for each
language, but its probably very close.

Can we consider something like ???
More open parallel corpora available here:

I mentioned europarl, even though it has less languages than say,  because especially interesting is
the note at the bottom: We are not aware of any copyright restrictions of
the material.

If there is no problem with this, I'd like to help. supporting more
languages is my itch.
Robert Muir

View raw message