Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 29297 invoked from network); 4 Dec 2003 00:26:20 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 4 Dec 2003 00:26:20 -0000 Received: (qmail 24896 invoked by uid 500); 4 Dec 2003 00:25:26 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 24855 invoked by uid 500); 4 Dec 2003 00:25:25 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 24824 invoked from network); 4 Dec 2003 00:25:25 -0000 Received: from unknown (HELO mz1.forethought.net) (216.241.36.12) by daedalus.apache.org with SMTP; 4 Dec 2003 00:25:25 -0000 Received: from j72.denver.dsl.forethought.net ([216.241.38.72] helo=www.doomdark.org) by mz1.forethought.net with esmtp (Exim 4.14) id 1ARhJF-0002cR-I7 for lucene-user@jakarta.apache.org; Wed, 03 Dec 2003 17:25:33 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Tatu Saloranta Reply-To: tatu@hypermall.net Organization: Linux-users missalie To: "Lucene Users List" Subject: Re: SearchBlox J2EE Search Component Version 1.1 released Date: Wed, 3 Dec 2003 17:27:18 -0700 User-Agent: KMail/1.4.3 References: <200312021651.hB2GpHQQ006075@smtp24.singnet.com.sg> In-Reply-To: <200312021651.hB2GpHQQ006075@smtp24.singnet.com.sg> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200312031727.18501.tatu@hypermall.net> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Tuesday 02 December 2003 09:51, Tun Lin wrote: > Anyone knows a search engine that supports xml formats? There's no way to generally "support xml formats", as xml is just a meta-language. However, building specific search engines using Lucene core it should be reasonably straight-forward to implement more accurate xml-structure-aware tokenization for specific xml applications like DocBook or other domain-specific apps. So, if any search engine advertises "indexing xml content", one better read the fine print to learn what they really claim. It might be interesting to create a Lucene plug-in that, given a specification of how sub trees under specific elements, would tokenize and index content into separate fields. Plus implementation shouldn't be very difficult -- just use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to analyzer and then add to index. This could also be used for HTML (pre-filtering with JTidy or similar first to get to xml-compliant HTML). I wouldn't be surprised if someone on list has already done this? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org