lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clint Gilbert <clint_gilb...@hms.harvard.edu>
Subject Federated relevance ranking
Date Thu, 02 Jun 2011 19:51:24 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everyone,

I searched the list archives, but couldn't find a question that closely
matches mine.

The project I'm working on is designed to allow searching a distributed
collection of data repositories.  Currently, we index each repository to
build a central Lucene index.  This works ok, but for practical (the
central index is getting very large) and architectural (decentralization
is a design goal) reasons, we'd like to distribute the index.

In the past, we had basic federation system in place: when a user
submitted a query, the query was broadcast to each data repository,
which had its own independent Lucene index.  Results from each repo were
aggregated in reverse order.

The problem was, of course, that since each index was constructed
independently of all the others, and documents are distributed in the
repos unevenly, it was impossible to rank the results from all the
indices in a meaningful way.  We basically punted and interleaved
results, which didn't gave a bad user experience, hence the temporary
switch to a central index.

So, what options exist for searching distributed collections of Lucene
indices and ranking results meaningfully?

Katta seems promising, but I don't know enough about it yet.  It also
seems to want to open its own ports for RPC.  I'd prefer something that
could tunnel over HTTP to minimize firewall drama.  (We will have 10s
and then 100s of data repos running in separate locations.)

We're also considering a home-grown scheme involving normalizing the
denominators of all the index components in all our indices, based on
the sums of counts obtained from all the indices.  This feels like
re-inventing the wheel, and it's not clear to me yet that the low-level
manipulation of indices that we'd need to do is even possible.

Any suggestions for distributing indices while ranking results well are
very welcome!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3n6bsACgkQ5IyIbnMUeTsOFACeM2lsWKXguf8XYUFdDbYtmzc1
Qd8Anjx670zjQ7KYjnxXVQXuR+CBjxCs
=Jnkt
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message