lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2841) CommonGramsFilter improvements
Date Thu, 09 May 2013 23:05:59 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2841:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
    
> CommonGramsFilter improvements
> ------------------------------
>
>                 Key: LUCENE-2841
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2841
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.1, 4.0-ALPHA
>            Reporter: Steve Rowe
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: commit-6402a55.patch
>
>
> Currently CommonGramsFilter expects users to remove the common words around which output
token ngrams are formed, by appending a StopFilter to the analysis pipeline.  This is inefficient
in two ways: captureState() is called on (trailing) stopwords, and then the whole stream has
to be re-examined by the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean option controlling
whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState() will still
need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather than calling
captureState() for the trailing token in each output token ngram, the term text, position
and offset can be maintained in the same way as they are now for the leading token: using
a System.arrayCopy()'d term buffer and a few ints for positionIncrement and offsetd.  The
user then no longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message