lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Gdata - Indexing feeds and entries
Date Thu, 20 Jul 2006 05:06:27 GMT
Hi Simon,

I'm not sure if I already replied to this or not.
Here are some thoughts.
Distributed indexing:
- you could take the Solr approach and have a Master indexing server that periodically takes
snapshops and tells Slave servers "hey, come get the new stuff".  The problem is that the
Master is the single point of failure.
- you could take a similar replication approach with DRDB ( or some such
- you could accept new entries in one place but delegate the indexing to multiple instances
of the GData server in parallel

As for searching, you could simply partition the traffic instead of partitioning the index.
 Not the same thing clearly, but it's probably simpler to do (throw a load balancer/proxy
in front of the search servers).  If you want to partition the index, you could simply employ
some logic that specifies the maximal size of the index.  Until that limit is reached you
index to the current index.  Once the limit is reached you start a new index, possibly on
a new server if that is available, or you start a new index and migrate the closed index elsewhere.

I imagine Yonik, Doug, and others will have other ideas, too.


----- Original Message ----
From: Simon Willnauer <>
Sent: Saturday, July 15, 2006 10:37:11 AM
Subject: Gdata - Indexing feeds and entries

Hi there,

it has been quiet about Gdata the last 2 weeks but all the exams are
done and uni has finished yesterday so next round can start up.
OK what needs to be done, the gdata protocol describes a kind of a
query language to query feed for full text search in defined xml
elements and / or custom elements. For that purpose the stored,
updated and deleted entries have to be reflected into the search
component to be available for searching.The indexer component of the
server has to notified about modification events to keep the index
I'm not yet sure how the fields / elements of the xml will be
configured but I guess I will look for some ideas in solr or nutch and
discuss that later.
My first and main problem is pretty well know on this mailinglist,
well I found lots of questions and suggestions via google but these
discussions are quite a while ago. I was wondering if there are some
new cognitions about distributed searching / indexing. The server
should be able to run in clusters/ server farms so indexed data must
be available on each server / machine. I thought about this for a
while and all my ideas seem to be problematic in a certain way.
i found this thread on the mailing list

which gives a lot of information about the problem I'm facing.

It would be great if some of you experienced guys could give me
information about your experience / solution to this problem. If you
see any possibility to provide such a mechanism as a generic solution
we could we could separate this as a new contrib project after SoC has
finished e.g. detach it from gdata.

thanks in advance for your  help ;)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message