lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin <cry...@yahoo.com>
Subject Re: HTMLStripReader, HTMLStripCharFilter
Date Mon, 26 Apr 2010 23:15:56 GMT
Thanks for the update!  I appreciate the hard work.

Perhaps someone can help me with the use of HTMLStripCharFilter...


I get an exception (3.1-dev) similar to the one reported here (2.9):

https://issues.apache.org/jira/browse/LUCENE-1695


With the following code:

    Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
        @Override
        protected TokenStreamComponents createComponents(
                final String fieldName, final Reader reader) {
            return new TokenStreamComponents(new StandardTokenizer(Version.LUCENE_30,
                    new HTMLStripCharFilter(CharReader.get(reader))));
        }
    };
    String content = reader.document(id, fieldSelector).get(field);
    TokenStream ts = htmlStripAnalyzer.tokenStream(field, new StringReader(content));
    String best = highlighter.getBestFragments(ts, content,
      DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR);
    OffsetAttribute off = ts.addAttribute(OffsetAttribute.class);
    ts.reset();
    ts.incrementToken();


java.io.IOException: Stream closed
        at java.io.StringReader.ensureOpen(StringReader.java:39)
        at java.io.StringReader.read(StringReader.java:73)
        at org.apache.lucene.analysis.CharReader.read(CharReader.java:54)
        at java.io.Reader.read(Reader.java:104)
        at org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.java:92)
        at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:690)
        at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748)
        at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:453)
        at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:639)
        at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:167)


Looking at the source, I wonder if Tokenizer should override reset():

  public void reset() throws IOException {
    if (input != null) input.reset(); // would reset CharReader, StringReader
  }





----- Original Message ----
From: Robert Muir <rcmuir@gmail.com>
To: java-user@lucene.apache.org
Sent: Sat, April 24, 2010 9:03:02 AM
Subject: Re: HTMLStripReader, HTMLStripCharFilter

On Fri, Apr 23, 2010 at 4:48 PM, Justin <crynax@yahoo.com> wrote:

> Just out of curiousity, why does LUCENE-1377 have a minor priorty?
>
> https://issues.apache.org/jira/browse/LUCENE-1377
>
> Don't people index, filter, search HTML, perhaps more than any other
> format?
>
>
Rest assured we are working on this... but it unfortunately won't happen
overnight. First of all, the development of Lucene and Solr was merged such
that there is now one team working on this stuff. This way, both Solr and
Lucene developers can maintain this stuff.

There is now the practical issue to combine all Lucene and Solr analyzers
(not just the two components listed on that issue) into one package that can
then be used by both Lucene and Solr users:
https://issues.apache.org/jira/browse/LUCENE-2413

-- 
Robert Muir
rcmuir@gmail.com



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message