lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Goodell <good...@gmail.com>
Subject Re: clean up html before indexing or add tags to ignore list
Date Thu, 13 May 2004 16:05:29 GMT
If you are running linux, i recommend before indexing with lucene, you
use the program lynx with the option -dump which dumps the formatted
text without the tags, and runs really really fast in most cases.

- andy g

On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
> 
> Clean up seems cleaner.  Just extract the textual information from HTML
> using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.
> 
> You can also get fancy and preserve the 'structural' information (e.g.
> H1 text is more important that H2, which is more important than BODY,
> which is more important that DIV, etc.) and combine it with field
> boosting at index time.
> 
> Otis
> 
> 
> 
> --- Sebastian Ho <sebastianh@bii.a-star.edu.sg> wrote:
> > Hi
> >
> > This is a typical web crawler, indexing and search application
> > development. I have wrote my crawler and planning to add lucene in
> > next.
> > One questions pop to my mind, in terms of performance, do i clean up
> > the
> > html removing all tags before indexing, or i add all tags into the
> > ignore list during indexing/search stage.
> >
> > Which is better?
> >
> > Thanks
> >
> > Sebastian Ho
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message