lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams
Date Wed, 27 Jun 2012 15:51:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402307#comment-13402307
] 

Steven Rowe commented on LUCENE-4170:
-------------------------------------

bq. I think shingles has a similar bug: it doesn't look at the existing posLength of the input
tokens at all, instead it just fills posLength with the builtGramSize.

I agree.

However, the problem isn't just position length: ShingleFilter has never handled input position
increments of zero, so real graph compatibility will mean fixing that too.

I think Karl Wettin's ShingleMatrixFilter (deprecated in 3.6, dropped in 4.0) is an attempt
to permute all combinations of overlapping (poslength=1) terms to produce shingles.  ShingleMatrixFilter
wouldn't handle poslength > 1, though.

I'm not even sure what token ngramming should mean over an input graph.  The trivial case
where input tokens' poslength is always zero and position increment is always one is obviously
already handled.

I think both issues should be handled, since poslength > 1 will very likely be used with
posincr = 0, e.g. synonyms and kuromoji de-compounding.

                
> TestRandomChains fail with Shingle+CommonGrams
> ----------------------------------------------
>
>                 Key: LUCENE-4170
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4170.patch
>
>
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A
-Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"
> This test has two shinglefilters, then a common-grams filter. I think posLen impls in
commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message