lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: HTML text extraction
Date Wed, 21 Jun 2006 07:37:27 GMT

if you just want something to extract the text from HTML, without trying
to extract structure (ie: you don't care about title vs h1 vs bold vs meta
keywords) then the HTMLStripReader (or
HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be
usefull.  It wasn't intended to deal with full HTML documents (hence it
doesn't have any mechanism for infering strucutre) but it was intended to
do the best job possible when deling with dirty data that might be plain
text, or it might be a chunk of HTML, or it might be mostly plain text
with a little bit of html sprinkled in.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message