Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 60914 invoked from network); 27 Apr 2010 05:01:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Apr 2010 05:01:05 -0000 Received: (qmail 22572 invoked by uid 500); 27 Apr 2010 05:01:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 22260 invoked by uid 500); 27 Apr 2010 05:01:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 22247 invoked by uid 99); 27 Apr 2010 05:01:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Apr 2010 05:01:02 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Apr 2010 05:00:54 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 6E2EAD36004 for ; Tue, 27 Apr 2010 07:00:33 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WOQ1q7EV2LTh for ; Tue, 27 Apr 2010 07:00:18 +0200 (CEST) Received: from VEGA (port-83-236-62-54.dynamic.qsc.de [83.236.62.54]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 6E524D36003 for ; Tue, 27 Apr 2010 07:00:17 +0200 (CEST) From: "Uwe Schindler" To: References: <43985.50280.qm@web32907.mail.mud.yahoo.com> <111835.48056.qm@web32903.mail.mud.yahoo.com> In-Reply-To: <111835.48056.qm@web32903.mail.mud.yahoo.com> Subject: RE: HTMLStripReader, HTMLStripCharFilter Date: Tue, 27 Apr 2010 07:00:31 +0200 Message-ID: <027101cae5c6$8ff9f990$afedecb0$@de> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Thread-index: Acrllo5g3ksClpzbTMKT8l127ruDgQAL+hVA Content-language: de X-Virus-Checked: Checked by ClamAV on apache.org To reset this token stream you have to wrap it with a CachingTokenFilter. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Justin [mailto:crynax@yahoo.com] > Sent: Tuesday, April 27, 2010 1:16 AM > To: java-user@lucene.apache.org > Subject: Re: HTMLStripReader, HTMLStripCharFilter > > Thanks for the update! I appreciate the hard work. > > Perhaps someone can help me with the use of HTMLStripCharFilter... > > > I get an exception (3.1-dev) similar to the one reported here (2.9): > > https://issues.apache.org/jira/browse/LUCENE-1695 > > > With the following code: > > Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() { > @Override > protected TokenStreamComponents createComponents( > final String fieldName, final Reader reader) { > return new TokenStreamComponents(new > StandardTokenizer(Version.LUCENE_30, > new HTMLStripCharFilter(CharReader.get(reader)))); > } > }; > String content = reader.document(id, fieldSelector).get(field); > TokenStream ts = htmlStripAnalyzer.tokenStream(field, new > StringReader(content)); > String best = highlighter.getBestFragments(ts, content, > DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR); > OffsetAttribute off = ts.addAttribute(OffsetAttribute.class); > ts.reset(); > ts.incrementToken(); > > > java.io.IOException: Stream closed > at java.io.StringReader.ensureOpen(StringReader.java:39) > at java.io.StringReader.read(StringReader.java:73) > at > org.apache.lucene.analysis.CharReader.read(CharReader.java:54) > at java.io.Reader.read(Reader.java:104) > at > org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j > ava:92) > at > org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j > ava:690) > at > org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j > ava:748) > at > org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan > dardTokenizerImpl.java:453) > at > org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken( > StandardTokenizerImpl.java:639) > at > org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St > andardTokenizer.java:167) > > > Looking at the source, I wonder if Tokenizer should override reset(): > > public void reset() throws IOException { > if (input != null) input.reset(); // would reset CharReader, > StringReader > } > > > > > > ----- Original Message ---- > From: Robert Muir > To: java-user@lucene.apache.org > Sent: Sat, April 24, 2010 9:03:02 AM > Subject: Re: HTMLStripReader, HTMLStripCharFilter > > On Fri, Apr 23, 2010 at 4:48 PM, Justin wrote: > > > Just out of curiousity, why does LUCENE-1377 have a minor priorty? > > > > https://issues.apache.org/jira/browse/LUCENE-1377 > > > > Don't people index, filter, search HTML, perhaps more than any other > > format? > > > > > Rest assured we are working on this... but it unfortunately won't > happen > overnight. First of all, the development of Lucene and Solr was merged > such > that there is now one team working on this stuff. This way, both Solr > and > Lucene developers can maintain this stuff. > > There is now the practical issue to combine all Lucene and Solr > analyzers > (not just the two components listed on that issue) into one package > that can > then be used by both Lucene and Solr users: > https://issues.apache.org/jira/browse/LUCENE-2413 > > -- > Robert Muir > rcmuir@gmail.com > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org