lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Trivial Update of "LuceneFAQ" by MartinJericho
Date Thu, 09 Apr 2009 23:27:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by MartinJericho:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ

The comment on the change is:
Updated link to Jericho HTML Parser TextExtractor javadoc

------------------------------------------------------------------------------
  
  The author of [http://furl.net FURL] recommends [http://www.tagsoup.info TagSoup].
  
- [http://jerichohtml.sourceforge.net/ Jericho HTML Parser] provides a simple [http://jerichohtml.sourceforge.net/doc/api/au/id/jericho/lib/html/TextExtractor.html
TextExtractor] class that converts any segment of an HTML document into a string of space-separated
words, optionally including the values from title, alt, label, and summary attributes.  The
parser is also very tolerant of badly formatted HTML and can also handle server-based source
tags such as JSP, ASP, PHP etc.
+ [http://jerichohtml.sourceforge.net/ Jericho HTML Parser] provides a simple [http://jericho.htmlparser.net/docs/javadoc/index.html?net/htmlparser/jericho/TextExtractor.html
TextExtractor] class that converts any segment of an HTML document into a string of space-separated
words, optionally including the values from title, alt, label, and summary attributes.  The
parser is also very tolerant of badly formatted HTML and can also handle server-based source
tags such as JSP, ASP, PHP etc.
  
  
  ==== How can I index XML documents? ====

Mime
View raw message