Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D3966D189 for ; Sun, 10 Feb 2013 12:41:14 +0000 (UTC) Received: (qmail 72666 invoked by uid 500); 10 Feb 2013 12:41:13 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 72602 invoked by uid 500); 10 Feb 2013 12:41:13 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 72581 invoked by uid 99); 10 Feb 2013 12:41:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Feb 2013 12:41:12 +0000 Date: Sun, 10 Feb 2013 12:41:12 +0000 (UTC) From: "Clinton Gormley (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley updated LUCENE-4766: ------------------------------------ Description: The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} "(([a-z]+)(\d*))" {code} when matched against: {code} "abc123def456" {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} "([A-Z]{2,})", "(? Pattern token filter which emits a token for every capturing group > ------------------------------------------------------------------ > > Key: LUCENE-4766 > URL: https://issues.apache.org/jira/browse/LUCENE-4766 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 4.1 > Reporter: Clinton Gormley > Priority: Minor > Labels: analysis, feature, lucene > Fix For: 4.2 > > Attachments: LUCENE-4766.patch > > > The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. > I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. > Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. > For instance a pattern like : > {code} > "(([a-z]+)(\d*))" > {code} > when matched against: > {code} > "abc123def456" > {code} > would produce the tokens: > {code} > abc123, abc, 123, def456, def, 456 > {code} > Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: > {code} > "([A-Z]{2,})", > "(? "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)", > "([0-9]+)" > {code} > When matched against the string "letsPartyLIKEits1999_dude", they would produce the tokens: > {code} > lets, Party, LIKE, its, 1999, dude > {code} > If no token is emitted, the original token is preserved. > If the preserveOriginal flag is true, it will output the full original token (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). > Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. > This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org