lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herbert Roitblat" <h...@orcatec.com>
Subject Re: HTMLStripReader, HTMLStripCharFilter
Date Tue, 27 Apr 2010 19:01:57 GMT
Great, I will look forward to it.
Thanks,
Herb
----- Original Message ----- 
From: "Justin" <crynax@yahoo.com>
To: <java-user@lucene.apache.org>
Sent: Tuesday, April 27, 2010 11:47 AM
Subject: Re: HTMLStripReader, HTMLStripCharFilter


> Thanks for the help.  No more exception.  Seems odd that I need to add a 
> filter to make reset apply to the stream's underlying reader.
>
>
>
>
> ----- Original Message ----
> From: Uwe Schindler <uwe@thetaphi.de>
> To: java-user@lucene.apache.org
> Sent: Tue, April 27, 2010 12:00:31 AM
> Subject: RE: HTMLStripReader, HTMLStripCharFilter
>
> To reset this token stream you have to wrap it with a CachingTokenFilter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Justin [mailto:crynax@yahoo.com]
>> Sent: Tuesday, April 27, 2010 1:16 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: HTMLStripReader, HTMLStripCharFilter
>>
>> Thanks for the update!  I appreciate the hard work.
>>
>> Perhaps someone can help me with the use of HTMLStripCharFilter...
>>
>>
>> I get an exception (3.1-dev) similar to the one reported here (2.9):
>>
>> https://issues.apache.org/jira/browse/LUCENE-1695
>>
>>
>> With the following code:
>>
>>     Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
>>         @Override
>>         protected TokenStreamComponents createComponents(
>>                 final String fieldName, final Reader reader) {
>>             return new TokenStreamComponents(new
>> StandardTokenizer(Version.LUCENE_30,
>>                     new HTMLStripCharFilter(CharReader.get(reader))));
>>         }
>>     };
>>     String content = reader.document(id, fieldSelector).get(field);
>>     TokenStream ts = htmlStripAnalyzer.tokenStream(field, new
>> StringReader(content));
>>     String best = highlighter.getBestFragments(ts, content,
>>       DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR);
>>     OffsetAttribute off = ts.addAttribute(OffsetAttribute.class);
>>     ts.reset();
>>     ts.incrementToken();
>>
>>
>> java.io.IOException: Stream closed
>>         at java.io.StringReader.ensureOpen(StringReader.java:39)
>>         at java.io.StringReader.read(StringReader.java:73)
>>         at
>> org.apache.lucene.analysis.CharReader.read(CharReader.java:54)
>>         at java.io.Reader.read(Reader.java:104)
>>         at
>> org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j
>> ava:92)
>>         at
>> org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
>> ava:690)
>>         at
>> org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
>> ava:748)
>>         at
>> org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan
>> dardTokenizerImpl.java:453)
>>         at
>> org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
>> StandardTokenizerImpl.java:639)
>>         at
>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St
>> andardTokenizer.java:167)
>>
>>
>> Looking at the source, I wonder if Tokenizer should override reset():
>>
>>   public void reset() throws IOException {
>>     if (input != null) input.reset(); // would reset CharReader,
>> StringReader
>>   }
>>
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Robert Muir <rcmuir@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Sat, April 24, 2010 9:03:02 AM
>> Subject: Re: HTMLStripReader, HTMLStripCharFilter
>>
>> On Fri, Apr 23, 2010 at 4:48 PM, Justin <crynax@yahoo.com> wrote:
>>
>> > Just out of curiousity, why does LUCENE-1377 have a minor priorty?
>> >
>> > https://issues.apache.org/jira/browse/LUCENE-1377
>> >
>> > Don't people index, filter, search HTML, perhaps more than any other
>> > format?
>> >
>> >
>> Rest assured we are working on this... but it unfortunately won't
>> happen
>> overnight. First of all, the development of Lucene and Solr was merged
>> such
>> that there is now one team working on this stuff. This way, both Solr
>> and
>> Lucene developers can maintain this stuff.
>>
>> There is now the practical issue to combine all Lucene and Solr
>> analyzers
>> (not just the two components listed on that issue) into one package
>> that can
>> then be used by both Lucene and Solr users:
>> https://issues.apache.org/jira/browse/LUCENE-2413
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message