lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elmer Garduno (JIRA)" <j...@apache.org>
Subject [jira] Reopened: (LUCENE-2229) SimpleSpanFragmenter fails to start a new fragment
Date Wed, 11 Aug 2010 22:19:19 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Elmer Garduno reopened LUCENE-2229:
-----------------------------------


> SimpleSpanFragmenter fails to start a new fragment
> --------------------------------------------------
>
>                 Key: LUCENE-2229
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2229
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Elmer Garduno
>            Priority: Minor
>         Attachments: LUCENE-2229.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SimpleSpanFragmenter fails to identify a new fragment when there is more than one stop
word after a span is detected. This problem can be observed when the Query contains a PhraseQuery.
> The problem is that the span extends toward the end of the TokenGroup. This is because
{{waitForProps = positionSpans.get(i).end + 1;}} and {{position += posIncAtt.getPositionIncrement();}}
this generates a value of {{position}} greater than the value of {{waitForProps}} and {{(waitForPos
== position)}} never matches.
> {code:title=SimpleSpanFragmenter.java}
>   public boolean isNewFragment() {
>     position += posIncAtt.getPositionIncrement();
>     if (waitForPos == position) {
>       waitForPos = -1;
>     } else if (waitForPos != -1) {
>       return false;
>     }
>     WeightedSpanTerm wSpanTerm = queryScorer.getWeightedSpanTerm(termAtt.term());
>     if (wSpanTerm != null) {
>       List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();
>       for (int i = 0; i < positionSpans.size(); i++) {
>         if (positionSpans.get(i).start == position) {
>           waitForPos = positionSpans.get(i).end + 1;
>           break;
>         }
>       }
>     }
>    ...
> {code}
> An example is provided in the test case for the following Document and the query *"all
tokens"* followed by the words _of a_.
> {panel:title=Document}
> "Attribute instances are reused for *all tokens* _of a_ document. Thus, a TokenStream/-Filter
needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the
Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again
until it retuns false, which indicates that the end of the stream was reached. This means
that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data
in the Attribute instances."
> {panel}
> {code:title=HighlighterTest.java}
>  public void testSimpleSpanFragmenter() throws Exception {
>     ...
>     doSearching("\"all tokens\"");
>     maxNumFragmentsRequired = 2;
>     
>     scorer = new QueryScorer(query, FIELD_NAME);
>     highlighter = new Highlighter(this, scorer);
>     for (int i = 0; i < hits.totalHits; i++) {
>       String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
>       TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));
>       highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));
>       String result = highlighter.getBestFragments(tokenStream, text,
>           maxNumFragmentsRequired, "...");
>       System.out.println("\t" + result);
>     }
>   }
> {code}
> {panel:title=Result}
> are reused for <B>all</B> <B>tokens</B> of a document. Thus,
a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The
consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls
incrementToken() again until it retuns false, which indicates that the end of the stream was
reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely
overwrite the data in the Attribute instances.
> {panel}
> {panel:title=Expected Result}
> for <B>all</B> <B>tokens</B> of a document
> {panel}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message