lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herbert Roitblat" <h...@orcatec.com>
Subject Re: HTMLStripReader, HTMLStripCharFilter
Date Tue, 27 Apr 2010 19:03:21 GMT
Oops. Sorry.  replied to wrong message.
----- Original Message ----- 
From: "Herbert Roitblat" <herb@orcatec.com>
To: <java-user@lucene.apache.org>
Sent: Tuesday, April 27, 2010 12:01 PM
Subject: Re: HTMLStripReader, HTMLStripCharFilter


> Great, I will look forward to it.
> Thanks,
> Herb
> ----- Original Message ----- 
> From: "Justin" <crynax@yahoo.com>
> To: <java-user@lucene.apache.org>
> Sent: Tuesday, April 27, 2010 11:47 AM
> Subject: Re: HTMLStripReader, HTMLStripCharFilter
>
>
>> Thanks for the help.  No more exception.  Seems odd that I need to add a 
>> filter to make reset apply to the stream's underlying reader.
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Uwe Schindler <uwe@thetaphi.de>
>> To: java-user@lucene.apache.org
>> Sent: Tue, April 27, 2010 12:00:31 AM
>> Subject: RE: HTMLStripReader, HTMLStripCharFilter
>>
>> To reset this token stream you have to wrap it with a CachingTokenFilter.
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Justin [mailto:crynax@yahoo.com]
>>> Sent: Tuesday, April 27, 2010 1:16 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: HTMLStripReader, HTMLStripCharFilter
>>>
>>> Thanks for the update!  I appreciate the hard work.
>>>
>>> Perhaps someone can help me with the use of HTMLStripCharFilter...
>>>
>>>
>>> I get an exception (3.1-dev) similar to the one reported here (2.9):
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-1695
>>>
>>>
>>> With the following code:
>>>
>>>     Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
>>>         @Override
>>>         protected TokenStreamComponents createComponents(
>>>                 final String fieldName, final Reader reader) {
>>>             return new TokenStreamComponents(new
>>> StandardTokenizer(Version.LUCENE_30,
>>>                     new HTMLStripCharFilter(CharReader.get(reader))));
>>>         }
>>>     };
>>>     String content = reader.document(id, fieldSelector).get(field);
>>>     TokenStream ts = htmlStripAnalyzer.tokenStream(field, new
>>> StringReader(content));
>>>     String best = highlighter.getBestFragments(ts, content,
>>>       DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR);
>>>     OffsetAttribute off = ts.addAttribute(OffsetAttribute.class);
>>>     ts.reset();
>>>     ts.incrementToken();
>>>
>>>
>>> java.io.IOException: Stream closed
>>>         at java.io.StringReader.ensureOpen(StringReader.java:39)
>>>         at java.io.StringReader.read(StringReader.java:73)
>>>         at
>>> org.apache.lucene.analysis.CharReader.read(CharReader.java:54)
>>>         at java.io.Reader.read(Reader.java:104)
>>>         at
>>> org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j
>>> ava:92)
>>>         at
>>> org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
>>> ava:690)
>>>         at
>>> org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
>>> ava:748)
>>>         at
>>> org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan
>>> dardTokenizerImpl.java:453)
>>>         at
>>> org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
>>> StandardTokenizerImpl.java:639)
>>>         at
>>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St
>>> andardTokenizer.java:167)
>>>
>>>
>>> Looking at the source, I wonder if Tokenizer should override reset():
>>>
>>>   public void reset() throws IOException {
>>>     if (input != null) input.reset(); // would reset CharReader,
>>> StringReader
>>>   }
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: Robert Muir <rcmuir@gmail.com>
>>> To: java-user@lucene.apache.org
>>> Sent: Sat, April 24, 2010 9:03:02 AM
>>> Subject: Re: HTMLStripReader, HTMLStripCharFilter
>>>
>>> On Fri, Apr 23, 2010 at 4:48 PM, Justin <crynax@yahoo.com> wrote:
>>>
>>> > Just out of curiousity, why does LUCENE-1377 have a minor priorty?
>>> >
>>> > https://issues.apache.org/jira/browse/LUCENE-1377
>>> >
>>> > Don't people index, filter, search HTML, perhaps more than any other
>>> > format?
>>> >
>>> >
>>> Rest assured we are working on this... but it unfortunately won't
>>> happen
>>> overnight. First of all, the development of Lucene and Solr was merged
>>> such
>>> that there is now one team working on this stuff. This way, both Solr
>>> and
>>> Lucene developers can maintain this stuff.
>>>
>>> There is now the practical issue to combine all Lucene and Solr
>>> analyzers
>>> (not just the two components listed on that issue) into one package
>>> that can
>>> then be used by both Lucene and Solr users:
>>> https://issues.apache.org/jira/browse/LUCENE-2413
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message