lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <>
Subject Re: SearchBlox J2EE Search Component Version 1.1 released
Date Thu, 04 Dec 2003 00:27:18 GMT
On Tuesday 02 December 2003 09:51, Tun Lin wrote:
> Anyone knows a search engine that supports xml formats?

There's no way to generally "support xml formats", as xml is just a 
meta-language. However, building specific search engines using Lucene core it 
should be reasonably straight-forward to implement more accurate 
xml-structure-aware tokenization for specific xml applications like DocBook 
or other domain-specific apps.
So, if any search engine advertises "indexing xml content", one better read 
the fine print to learn what they really claim.

It might be interesting to create a Lucene plug-in that, given a specification 
of how sub trees under specific elements, would tokenize and index content 
into separate fields. Plus implementation shouldn't be very difficult -- just 
use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to 
analyzer and then add to index. This could also be used for HTML 
(pre-filtering with JTidy or similar first to get to xml-compliant HTML).
I wouldn't be surprised if someone on list has already done this?

-+ Tatu +-

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message