Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 5592 invoked from network); 26 Apr 2010 23:16:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Apr 2010 23:16:29 -0000 Received: (qmail 44386 invoked by uid 500); 26 Apr 2010 23:16:27 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 44350 invoked by uid 500); 26 Apr 2010 23:16:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 44342 invoked by uid 99); 26 Apr 2010 23:16:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Apr 2010 23:16:27 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.191.69.80] (HELO web32903.mail.mud.yahoo.com) (209.191.69.80) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 26 Apr 2010 23:16:18 +0000 Received: (qmail 54013 invoked by uid 60001); 26 Apr 2010 23:15:56 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1272323756; bh=pObg8iMsON94Sz9eANkfFmZft0pxgFFxpWaIwFyrH+w=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=pJV+RiRjOT42tIkYD7wPsZK34eD/lTaRAtc+p9e4O62XLJF+wV0XNS/PWcYSzEUZiAFzcjw3cOx99qXS/09CF522lceqbTCLFRWUjLKoNNexseLBApzfI/083G76Eejxb2eldGtuuZSb5198PyoR3SsuqTbvDCCkSi5RFUPKSzA= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=QO664fF9jZQc0u9YwaM/6WJQ1A5gK9exRYk/3WO/zm1vgoU2qQlqtihGU7A18R63tm91yJ1r000o5pDWr0AxIi9lbZcF67pYtpjqScsEIB0IdGS+a5bp0/9nDybLIk9ue/7XyXkFgxO3ppHnwJK9kpgqvhMK9R94KFjoTV+LLss=; Message-ID: <111835.48056.qm@web32903.mail.mud.yahoo.com> X-YMail-OSG: lG37qRoVM1mNIzVW1Si8rxKVQ3QQjccidxsEUvFAJt_BXO_ vsu2LP8TwDtAhCrUQnqKpim8vuV82hE7r6vBVSSs2S8VcZq61GCMRtpIaABm _yASWkk13Hro6bDQRvp6bXvK.5u9LY.Y7ln2bsWpor8CXFHy8i5hNf_ho_xz d4zO8K7XfI_j1iL32V4.QVZ5DhW3l3Xr.1W0IteMdxkS.l6kR32ptxrQgnPk CZL8oAIEoldGmOZdW5XvkY7hnElHcs9vx30yclC62lAYfqZSi3JEPWXjZQ9X XgMJclcO0NUKh5EZEx0oQcLmQI06yQcWuO6I4R.ndqV3P0PhG07tJlcO3L.W C2i6fYfea6uyo3UNaB1WLu.H6Jg-- Received: from [72.36.94.20] by web32903.mail.mud.yahoo.com via HTTP; Mon, 26 Apr 2010 16:15:56 PDT X-Mailer: YahooMailRC/348.5 YahooMailWebService/0.8.102.267879 References: <43985.50280.qm@web32907.mail.mud.yahoo.com> Date: Mon, 26 Apr 2010 16:15:56 -0700 (PDT) From: Justin Subject: Re: HTMLStripReader, HTMLStripCharFilter To: java-user@lucene.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Checked: Checked by ClamAV on apache.org Thanks for the update! I appreciate the hard work. Perhaps someone can help me with the use of HTMLStripCharFilter... I get an exception (3.1-dev) similar to the one reported here (2.9): https://issues.apache.org/jira/browse/LUCENE-1695 With the following code: Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() { @Override protected TokenStreamComponents createComponents( final String fieldName, final Reader reader) { return new TokenStreamComponents(new StandardTokenizer(Version.LUCENE_30, new HTMLStripCharFilter(CharReader.get(reader)))); } }; String content = reader.document(id, fieldSelector).get(field); TokenStream ts = htmlStripAnalyzer.tokenStream(field, new StringReader(content)); String best = highlighter.getBestFragments(ts, content, DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR); OffsetAttribute off = ts.addAttribute(OffsetAttribute.class); ts.reset(); ts.incrementToken(); java.io.IOException: Stream closed at java.io.StringReader.ensureOpen(StringReader.java:39) at java.io.StringReader.read(StringReader.java:73) at org.apache.lucene.analysis.CharReader.read(CharReader.java:54) at java.io.Reader.read(Reader.java:104) at org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.java:92) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:690) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:453) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:639) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:167) Looking at the source, I wonder if Tokenizer should override reset(): public void reset() throws IOException { if (input != null) input.reset(); // would reset CharReader, StringReader } ----- Original Message ---- From: Robert Muir To: java-user@lucene.apache.org Sent: Sat, April 24, 2010 9:03:02 AM Subject: Re: HTMLStripReader, HTMLStripCharFilter On Fri, Apr 23, 2010 at 4:48 PM, Justin wrote: > Just out of curiousity, why does LUCENE-1377 have a minor priorty? > > https://issues.apache.org/jira/browse/LUCENE-1377 > > Don't people index, filter, search HTML, perhaps more than any other > format? > > Rest assured we are working on this... but it unfortunately won't happen overnight. First of all, the development of Lucene and Solr was merged such that there is now one team working on this stuff. This way, both Solr and Lucene developers can maintain this stuff. There is now the practical issue to combine all Lucene and Solr analyzers (not just the two components listed on that issue) into one package that can then be used by both Lucene and Solr users: https://issues.apache.org/jira/browse/LUCENE-2413 -- Robert Muir rcmuir@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org