Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Content-Type: text/plain;
  charset="iso-8859-1"
From: Tatu Saloranta <tatu@hypermall.net>
Reply-To: tatu@hypermall.net
Organization: Linux-users missalie
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Subject: Re: SearchBlox J2EE Search Component Version 1.1 released
Date: Wed, 3 Dec 2003 17:27:18 -0700
User-Agent: KMail/1.4.3
References: <200312021651.hB2GpHQQ006075@smtp24.singnet.com.sg>
In-Reply-To: <200312021651.hB2GpHQQ006075@smtp24.singnet.com.sg>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: <200312031727.18501.tatu@hypermall.net>

On Tuesday 02 December 2003 09:51, Tun Lin wrote:
> Anyone knows a search engine that supports xml formats?

There's no way to generally "support xml formats", as xml is just a 
meta-language. However, building specific search engines using Lucene core it 
should be reasonably straight-forward to implement more accurate 
xml-structure-aware tokenization for specific xml applications like DocBook 
or other domain-specific apps.
So, if any search engine advertises "indexing xml content", one better read 
the fine print to learn what they really claim.

It might be interesting to create a Lucene plug-in that, given a specification 
of how sub trees under specific elements, would tokenize and index content 
into separate fields. Plus implementation shouldn't be very difficult -- just 
use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to 
analyzer and then add to index. This could also be used for HTML 
(pre-filtering with JTidy or similar first to get to xml-compliant HTML).
I wouldn't be surprised if someone on list has already done this?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org