lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua O'Madadhain" <>
Subject Re: Tags Screwing up Searches
Date Tue, 22 Oct 2002 00:57:19 GMT
On Mon, 21 Oct 2002, Terry Steichen wrote:

> Thanks for the comments - you might have something there.  What I do
> is clean up the HTML with JTidy and then parse it into a DOM.  Then I
> use selected parts to create a new DOM which I write out as an XML
> file.  I then use Lucene to index the XML files.  Upon retrieval, I
> once again parse the XML, format it and render it to a browser.
> The conversion from brackets to entities is necessary in order for the
> browser (which will subsequently view it) to render it properly.
> But maybe, in the indexing process, I could convert it back again (to
> brackets), but I'm not sure what to do with it then - in other words,
> how to bring an HTML parser into the picture.  If you have ideas on
> this, I'd very much appreciate hearing them.

Perhaps there is some reason for the conversion to XML that I'm not
understanding (and this isn't really within my area of expertise).  

But if your purpose is to index HTML files and then display them later in
response to a search, why not just use JTidy and then index the HTML
instead (skipping the DOM and XML stages entirely), and then return the
(cleaned-up) HTML later when asked for?  The basis of any 'semantic' tags
that you might be putting in the XML (perhaps to define Lucene fields)
must be there in the HTML anyway, so I'm not sure what the DOM and XML
representations get you.


Joshua O'Madadhain Per
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message