lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer" <>
Subject GData - Server, Indexing entries
Date Wed, 19 Jul 2006 21:52:22 GMT
Hello everyone,

well the last mailing about distributed indexing / searching did not
receive many answers, maybe that's why the topic is very tough. Anyway
I try to kick of the indexing / searching milestone with another
The Gdata server has to index all incoming entries on inserts or
updates and mark already indexed entries as deleted on delete
requests. So the format of incoming data will be XML in the first
place. How and which XML elements are supposed to be indexed will be
defined in the server configuration. I guess it would be quiet handy
to configure which elements to index using xpath expressions. That's
fairly generic and the most developers and admins are more or less
familiar with xpath. Analyzer etc. will also come from the
configuration file.
The next step is to retrieve the data from within the elements.
Elements have three types of content relevant for indexing plain text,
html, xhtml (binary content might be tough to index :)
I have to remove the tags from the Html and XHtml content I'm aware of
that there are several api's around doing that but it might be quite
helpful to have some recommendations.

GData defines a kind of a query "language" to query the a specific
feed via get parameters and / or defined endings of the query string.
I do have some experience with building parsers (not javacc but yacc /
gentle) so I try to parse the so called "Gdata Query" to translate it
into a lucene query string. Using javaCC I can create a quite fast and
nice way to create lucene queries from incoming "Gdata Queries".

I do have lots of ideas to extend the search capabilities described in
the gdata protocol but I guess I will skip that after SoC has

I just wanna ask you guys to let me know if you have some ideas about all that.
Every comment will be highly appreciated!!!!

regards Simon

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message