Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 13923 invoked from network); 9 Apr 2009 23:27:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Apr 2009 23:27:58 -0000 Received: (qmail 70636 invoked by uid 500); 9 Apr 2009 23:27:58 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 70555 invoked by uid 500); 9 Apr 2009 23:27:58 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 70546 invoked by uid 99); 9 Apr 2009 23:27:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2009 23:27:58 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO aurora.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2009 23:27:47 +0000 Received: from aurora.apache.org (localhost [127.0.0.1]) by aurora.apache.org (8.13.8+Sun/8.13.8) with ESMTP id n39NRRpq015745 for ; Thu, 9 Apr 2009 23:27:27 GMT Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: java-commits@lucene.apache.org Date: Thu, 09 Apr 2009 23:27:27 -0000 Message-ID: <20090409232727.15077.84943@aurora.apache.org> Subject: [Lucene-java Wiki] Trivial Update of "LuceneFAQ" by MartinJericho X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification. The following page has been changed by MartinJericho: http://wiki.apache.org/jakarta-lucene/LuceneFAQ The comment on the change is: Updated link to Jericho HTML Parser TextExtractor javadoc ------------------------------------------------------------------------------ The author of [http://furl.net FURL] recommends [http://www.tagsoup.info TagSoup]. - [http://jerichohtml.sourceforge.net/ Jericho HTML Parser] provides a simple [http://jerichohtml.sourceforge.net/doc/api/au/id/jericho/lib/html/TextExtractor.html TextExtractor] class that converts any segment of an HTML document into a string of space-separated words, optionally including the values from title, alt, label, and summary attributes. The parser is also very tolerant of badly formatted HTML and can also handle server-based source tags such as JSP, ASP, PHP etc. + [http://jerichohtml.sourceforge.net/ Jericho HTML Parser] provides a simple [http://jericho.htmlparser.net/docs/javadoc/index.html?net/htmlparser/jericho/TextExtractor.html TextExtractor] class that converts any segment of an HTML document into a string of space-separated words, optionally including the values from title, alt, label, and summary attributes. The parser is also very tolerant of badly formatted HTML and can also handle server-based source tags such as JSP, ASP, PHP etc. ==== How can I index XML documents? ====