lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clinton Gormley (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
Date Sun, 10 Feb 2013 12:33:12 GMT
Clinton Gormley created LUCENE-4766:
---------------------------------------

             Summary: Pattern token filter which emits a token for every capturing group
                 Key: LUCENE-4766
                 URL: https://issues.apache.org/jira/browse/LUCENE-4766
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
    Affects Versions: 4.1
            Reporter: Clinton Gormley
            Priority: Minor
             Fix For: 4.2


The PatternTokenizer either functions by splitting on matches, or allows you to specify a
single capture group.  This is insufficient for my needs. Quite often I want to capture multiple
overlapping tokens in the same position.

I've written a pattern token filter which accepts multiple patterns and emits tokens for every
capturing group that is matched in any pattern.
Patterns are not anchored to the beginning and end of the string, so each pattern can produce
multiple matches.

For instance a pattern like "(([a-z]+)(\d*))" when matched against "abc123def456" would produce
the tokens:

    abc123, abc, 123, def456, def, 456

Multiple patterns can be applied, eg these patterns could be used for camelCase analysis:

    "([A-Z]{2,})",
    "(?<![A-Z])([A-Z][a-z]+)",
    "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
    "([0-9]+)"

When matched against the string "letsPartyLIKEits1999_dude", they would produce the tokens:

    lets, Party, LIKE, its, 1999, dude

If no token is emitted, the original token is preserved. 
If the preserveOriginal flag is true, it will output the full original token (ie "letsPartyLIKEits1999_dude")
in addition to any matching tokens (but in this case, if a matching token is identical to
the original, it will only emit one copy of the full token).

Multiple patterns are required to allow overlapping captures, but also means that patterns
are less dense and easier to understand.

This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message