lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
Date Mon, 11 Feb 2013 11:29:12 GMT


Adrien Grand commented on LUCENE-4766:

bq. I just wonder if we really should restrict our TF to not fix offsets? Kind of an odd thing
though. What should a tokenfilter like this do instead?

I think that for some examples, it makes sense not to fix offsets? In the case of the URL
example ({{(https?://([a-zA-Z\-_0-9.]+))}}), I think it makes sense to highlight the whole
URL (including the leading http(s)://) even if the query term is just {{}}.
On the other hand, it could be weird if the goal was to split a long CamelCase token (letsPartyLIKEits1999_dude),
but maybe this should be done by a Tokenizer rather than a TokenFilter?

(No strong feeling here, I'd just like to see if we can find a way to commit this patch without
having to grow our TokenFilter exclusion list.)
> Pattern token filter which emits a token for every capturing group
> ------------------------------------------------------------------
>                 Key: LUCENE-4766
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.1
>            Reporter: Clinton Gormley
>            Assignee: Simon Willnauer
>            Priority: Minor
>              Labels: analysis, feature, lucene
>             Fix For: 4.2
>         Attachments: LUCENE-4766.patch, LUCENE-4766.patch
> The PatternTokenizer either functions by splitting on matches, or allows you to specify
a single capture group.  This is insufficient for my needs. Quite often I want to capture
multiple overlapping tokens in the same position.
> I've written a pattern token filter which accepts multiple patterns and emits tokens
for every capturing group that is matched in any pattern.
> Patterns are not anchored to the beginning and end of the string, so each pattern can
produce multiple matches.
> For instance a pattern like :
> {code}
>     "(([a-z]+)(\d*))"
> {code}
> when matched against: 
> {code}
>     "abc123def456"
> {code}
> would produce the tokens:
> {code}
>     abc123, abc, 123, def456, def, 456
> {code}
> Multiple patterns can be applied, eg these patterns could be used for camelCase analysis:
> {code}
>     "([A-Z]{2,})",
>     "(?<![A-Z])([A-Z][a-z]+)",
>     "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
>     "([0-9]+)"
> {code}
> When matched against the string "letsPartyLIKEits1999_dude", they would produce the tokens:
> {code}
>     lets, Party, LIKE, its, 1999, dude
> {code}
> If no token is emitted, the original token is preserved. 
> If the preserveOriginal flag is true, it will output the full original token (ie "letsPartyLIKEits1999_dude")
in addition to any matching tokens (but in this case, if a matching token is identical to
the original, it will only emit one copy of the full token).
> Multiple patterns are required to allow overlapping captures, but also means that patterns
are less dense and easier to understand.
> This is my first Java code, so apologies if I'm doing something stupid.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message