lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2014) position increment bug: smartcn
Date Thu, 29 Oct 2009 09:12:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771344#action_12771344
] 

Uwe Schindler edited comment on LUCENE-2014 at 10/29/09 9:11 AM:
-----------------------------------------------------------------

bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter should use
captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
bq. If i use capture/restoreState this should not be a problem right?

I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes
when you produce more than one token out of one big one (the sentence). No need for captureState,
because the tokens are new ones. If somebody adds custom attributes, they would have cleared,
but would that be not correct?

bq. I guess the only advantage would be that it would preserve any customAttributes or payloads
that someone might add after the SentenceTokenizer, but before the WordTokenFilter propagating
them downto the individual words.

Does this make sense to insert a filter between both? The transition from sentence tokens
to word tokens creates totally different tokens, how should a payload or other custom att
work correct here? Normally such payload filters should be inserted after the WordFilter.
The problem of capture/restore state is addiional copy cost for nothing (the *long* sentence
token is copied again and again and always reset to the text word).

      was (Author: thetaphi):
    bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter should
use captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
bq. If i use capture/restoreState this should not be a problem right?

I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes
when you produce more than one token out of one big one (the sentence). No need for captureState,
because the tokens are new ones. If somebody adds custom attributes, they would have cleared,
but would that be not correct?
  
> position increment bug: smartcn
> -------------------------------
>
>                 Key: LUCENE-2014
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2014
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.0
>
>         Attachments: LUCENE-2014.patch, LUCENE-2014.patch
>
>
> If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter
with any reasonable amount of chinese text.
> its especially annoying because it happens in 2.9.1 RC as well.
> this is because the position increments for tokens after stopwords are bogus:
> Here's an example (from test case), where the position increment should be 2, but is
instead 91975314!
> {code}
>   public void testChineseStopWords2() throws Exception {
>     Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords
*/
>     String sentence = "Title:San"; // : is a stopword
>     String result[] = { "titl", "san"};
>     int startOffsets[] = { 0, 6 };
>     int endOffsets[] = { 5, 9 };
>     int posIncr[] = { 1, 2 };
>     assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
>   }
> {code}
> junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
> 	at junit.framework.Assert.fail(Assert.java:47)
> 	at junit.framework.Assert.failNotEquals(Assert.java:280)
> 	at junit.framework.Assert.assertEquals(Assert.java:64)
> 	at junit.framework.Assert.assertEquals(Assert.java:198)
> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
> 	...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message