lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: clean up html before indexing or add tags to ignore list
Date Thu, 13 May 2004 10:46:37 GMT
Clean up seems cleaner.  Just extract the textual information from HTML
using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.

You can also get fancy and preserve the 'structural' information (e.g.
H1 text is more important that H2, which is more important than BODY,
which is more important that DIV, etc.) and combine it with field
boosting at index time.

Otis

--- Sebastian Ho <sebastianh@bii.a-star.edu.sg> wrote:
> Hi
> 
> This is a typical web crawler, indexing and search application
> development. I have wrote my crawler and planning to add lucene in
> next.
> One questions pop to my mind, in terms of performance, do i clean up
> the
> html removing all tags before indexing, or i add all tags into the
> ignore list during indexing/search stage. 
> 
> Which is better?
> 
> Thanks
> 
> Sebastian Ho
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message