Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E5DFD168 for ; Sun, 10 Feb 2013 12:33:14 +0000 (UTC) Received: (qmail 56093 invoked by uid 500); 10 Feb 2013 12:33:13 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 56029 invoked by uid 500); 10 Feb 2013 12:33:12 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 56019 invoked by uid 99); 10 Feb 2013 12:33:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Feb 2013 12:33:12 +0000 Date: Sun, 10 Feb 2013 12:33:12 +0000 (UTC) From: "Clinton Gormley (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (LUCENE-4766) Pattern token filter which emits a token for every capturing group MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Clinton Gormley created LUCENE-4766: --------------------------------------- Summary: Pattern token filter which emits a token for every capturing group Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Priority: Minor Fix For: 4.2 The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like "(([a-z]+)(\d*))" when matched against "abc123def456" would produce the tokens: abc123, abc, 123, def456, def, 456 Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: "([A-Z]{2,})", "(?