Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 82296 invoked from network); 11 Aug 2010 22:19:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Aug 2010 22:19:44 -0000 Received: (qmail 60499 invoked by uid 500); 11 Aug 2010 22:19:43 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 60446 invoked by uid 500); 11 Aug 2010 22:19:42 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 60330 invoked by uid 99); 11 Aug 2010 22:19:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 22:19:42 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 22:19:40 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o7BMJIEj017609 for ; Wed, 11 Aug 2010 22:19:18 GMT Message-ID: <3646177.290471281565158870.JavaMail.jira@thor> Date: Wed, 11 Aug 2010 18:19:18 -0400 (EDT) From: "Elmer Garduno (JIRA)" To: dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-2229) SimpleSpanFragmenter fails to start a new fragment MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elmer Garduno updated LUCENE-2229: ---------------------------------- Original Estimate: 24h (was: 72h) Remaining Estimate: 24h (was: 72h) > SimpleSpanFragmenter fails to start a new fragment > -------------------------------------------------- > > Key: LUCENE-2229 > URL: https://issues.apache.org/jira/browse/LUCENE-2229 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/highlighter > Reporter: Elmer Garduno > Priority: Minor > Attachments: LUCENE-2229.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > SimpleSpanFragmenter fails to identify a new fragment when there is more than one stop word after a span is detected. This problem can be observed when the Query contains a PhraseQuery. > The problem is that the span extends toward the end of the TokenGroup. This is because {{waitForProps = positionSpans.get(i).end + 1;}} and {{position += posIncAtt.getPositionIncrement();}} this generates a value of {{position}} greater than the value of {{waitForProps}} and {{(waitForPos == position)}} never matches. > {code:title=SimpleSpanFragmenter.java} > public boolean isNewFragment() { > position += posIncAtt.getPositionIncrement(); > if (waitForPos == position) { > waitForPos = -1; > } else if (waitForPos != -1) { > return false; > } > WeightedSpanTerm wSpanTerm = queryScorer.getWeightedSpanTerm(termAtt.term()); > if (wSpanTerm != null) { > List positionSpans = wSpanTerm.getPositionSpans(); > for (int i = 0; i < positionSpans.size(); i++) { > if (positionSpans.get(i).start == position) { > waitForPos = positionSpans.get(i).end + 1; > break; > } > } > } > ... > {code} > An example is provided in the test case for the following Document and the query *"all tokens"* followed by the words _of a_. > {panel:title=Document} > "Attribute instances are reused for *all tokens* _of a_ document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it retuns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances." > {panel} > {code:title=HighlighterTest.java} > public void testSimpleSpanFragmenter() throws Exception { > ... > doSearching("\"all tokens\""); > maxNumFragmentsRequired = 2; > > scorer = new QueryScorer(query, FIELD_NAME); > highlighter = new Highlighter(this, scorer); > for (int i = 0; i < hits.totalHits; i++) { > String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME); > TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text)); > highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20)); > String result = highlighter.getBestFragments(tokenStream, text, > maxNumFragmentsRequired, "..."); > System.out.println("\t" + result); > } > } > {code} > {panel:title=Result} > are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it retuns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances. > {panel} > {panel:title=Expected Result} > for all tokens of a document > {panel} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org