lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin <cry...@yahoo.com>
Subject Re: Highlighter usage
Date Thu, 29 Apr 2010 22:09:41 GMT
Yeah, I understood the difference between StandardAnalyzer and StandardTokenizer.  I wasn't
sure that I want to use the StopFilter for highlighting, particularly in conjunction with
phrase queries (e.g. "Alexander The Great").  StandardAnalyzer and StandardTokenizer are final,
but I found my solution looking at the source for StandardAnalyzer: an alternate constructor
for TokenStreamComponents that accepts the filters I choose.


  private static Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
      @Override
      protected TokenStreamComponents createComponents(
              final String fieldName, final Reader reader) {
          final Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_30,
              new HTMLStripCharFilter(CharReader.get(reader)));
          TokenStream stream = new StandardFilter(tokenizer);
          stream = new LowerCaseFilter(Version.LUCENE_30, stream);
          stream = new SnowballFilter(stream, "English");
          return new TokenStreamComponents(tokenizer, stream);
      }
  };




----- Original Message ----
From: Erick Erickson <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Sent: Thu, April 29, 2010 4:21:35 PM
Subject: Re: Highlighter usage

That's the *StandartTokenizer*, which is not at all identical to
StandardAnalyzer. From
the Javadoc for StandardAnalyzer:
Filters StandardTokenizer<http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
with StandardFilter<http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardFilter.html>
, LowerCaseFilter<http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/LowerCaseFilter.html>
and StopFilter<http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/StopFilter.html>,
using a list of English stop words.

So your case issue is entirely explained. You have no reason at all to
expect
stemming to work unless you put a stemmer in the chain....

HTH
Erick

On Thu, Apr 29, 2010 at 5:10 PM, Justin <crynax@yahoo.com> wrote:

> I'm using my own analyzer so I can interject HTMLStripCharFilter as
> described in a previous thread.
>
>  private static Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
>      @Override
>      protected TokenStreamComponents createComponents(
>              final String fieldName, final Reader reader) {
>            return new TokenStreamComponents(new
> StandardTokenizer(Version.LUCENE_30,
>                    new HTMLStripCharFilter(CharReader.get(reader))));
>      }
>  };
>
> So should I replace StandardTokenizer with LowerCaseTokenizer above?  Then
> would I override nextToken() and stem or is there a better way to use
> something like SnowballFilter?
>
>
>
>
> ----- Original Message ----
> From: Erick Erickson <erickerickson@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thu, April 29, 2010 3:30:09 PM
> Subject: Re: Highlighter usage
>
> What analyzer are you using at index time? My guess is something
> like WhitespaceAnalyzer that doesn't stem or change case..... Try
> a different analyzer, SimpleAnalyzer comes to mind....
>
> HTH
> Erick
>
> On Thu, Apr 29, 2010 at 4:21 PM, Justin <crynax@yahoo.com> wrote:
>
> > I'm trying to use Highlighter with QueryScorer after reading:
> >
> > https://issues.apache.org/jira/browse/LUCENE-1685
> >
> > The problem is: I'm not getting a result unless my the query term is an
> > exact match.  Am I missing filters?  Is there a more complete example of
> how
> > this should work?
> >
> >
> >    String content = "Global Climate Change affects us all";
> >
> >    String field = "content";
> >    BooleanQuery query = new BooleanQuery();
> >
> >    // Unstemmed, matched case works
> >    //query.add(new TermQuery(new Term(field, "Climate")),
> > BooleanClause.Occur.MUST);
> >
> >    // Stemmed, lowercase doesn't work
> >    query.add(new TermQuery(new Term(field, "climat")),
> > BooleanClause.Occur.MUST);
> >    query.add(new TermQuery(new Term(field, "affect")),
> > BooleanClause.Occur.MUST);
> >
> >    Highlighter highlighter = new Highlighter(new QueryScorer(query,
> > field));
> >    highlighter.setMaxDocCharsToAnalyze(500000);
> >
> >    TokenStream ts = htmlStripAnalyzer.tokenStream(field, new
> > StringReader(content));
> >    ts = new CachingTokenFilter(ts);
> >
> >    System.out.println(highlighter.getBestFragment(ts, content));
> >
> >
> > Thanks for any feedback,
> > Justin
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message