lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Federated relevance ranking
Date Thu, 02 Jun 2011 23:05:49 GMT
As you've found out, raw scores certainly aren't comparable across
different indexes
#unless# the documents are fairly distributed. You're talking large
indexes here,
so if the documents are balanced across all your indexes, the results should be
pretty comparable. This pre-supposes that the indexes share a common schema
and that the distributions of terms are "close enough to identical" to be truly
comparable. And it supposes that your indexes are similar in
character. It wouldn't
work if one of your indexes had, say, meta-data from videos and another had
scholarly journal articles.

Otherwise, there's work going on in Solr that might help, although I
don't know when
that'll be available.

Other than that, I don't know what to suggest. It's not an easy
problem or Solr/Lucene
would already have solved it.. siiiggggh.

Best
Erick

On Thu, Jun 2, 2011 at 3:51 PM, Clint Gilbert
<clint_gilbert@hms.harvard.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi everyone,
>
> I searched the list archives, but couldn't find a question that closely
> matches mine.
>
> The project I'm working on is designed to allow searching a distributed
> collection of data repositories.  Currently, we index each repository to
> build a central Lucene index.  This works ok, but for practical (the
> central index is getting very large) and architectural (decentralization
> is a design goal) reasons, we'd like to distribute the index.
>
> In the past, we had basic federation system in place: when a user
> submitted a query, the query was broadcast to each data repository,
> which had its own independent Lucene index.  Results from each repo were
> aggregated in reverse order.
>
> The problem was, of course, that since each index was constructed
> independently of all the others, and documents are distributed in the
> repos unevenly, it was impossible to rank the results from all the
> indices in a meaningful way.  We basically punted and interleaved
> results, which didn't gave a bad user experience, hence the temporary
> switch to a central index.
>
> So, what options exist for searching distributed collections of Lucene
> indices and ranking results meaningfully?
>
> Katta seems promising, but I don't know enough about it yet.  It also
> seems to want to open its own ports for RPC.  I'd prefer something that
> could tunnel over HTTP to minimize firewall drama.  (We will have 10s
> and then 100s of data repos running in separate locations.)
>
> We're also considering a home-grown scheme involving normalizing the
> denominators of all the index components in all our indices, based on
> the sums of counts obtained from all the indices.  This feels like
> re-inventing the wheel, and it's not clear to me yet that the low-level
> manipulation of indices that we'd need to do is even possible.
>
> Any suggestions for distributing indices while ranking results well are
> very welcome!
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk3n6bsACgkQ5IyIbnMUeTsOFACeM2lsWKXguf8XYUFdDbYtmzc1
> Qd8Anjx670zjQ7KYjnxXVQXuR+CBjxCs
> =Jnkt
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message