Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7BEF71043E for ; Tue, 17 Feb 2015 11:44:49 +0000 (UTC) Received: (qmail 44195 invoked by uid 500); 17 Feb 2015 11:44:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 44134 invoked by uid 500); 17 Feb 2015 11:44:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 44122 invoked by uid 99); 17 Feb 2015 11:44:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2015 11:44:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ravikumar.govindarajan@gmail.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-wi0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2015 11:44:22 +0000 Received: by mail-wi0-f175.google.com with SMTP id r20so32152097wiv.2 for ; Tue, 17 Feb 2015 03:42:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=tODkt9FglRkfqLeMLspTGnY/e9ldGzhNdNws+wo9GKY=; b=VDS+bKCVjqFPFc5+CbZovEEaRBjBCqPGJimVxkSa5MQBKmBo96To5gfpchEwxs2npV A7zcvXtyh8ryQ1HhElrW6pNhwSmgYkpRtESnpUKICQA4yhDAvINGBr5lz5PIMjrpddQ+ ETL/r3tKv1L6yG/YECmMh2fjUdWj5TgjRWAzbX88NrXy/JqstVeCQRQd9Qzx2MTit/A/ 8KPfpOa5Wks/StTKuB5CHjYphcsQHar6EnRNUr5futVrBntesSnySo40KgehEkXsbM1A Poha3RFWNCQ8/b1CpVuSbhTxlL2V8w/JALR44btG89m4bGlSRJEggiw7/fiFnq/+J4Hu v8vQ== MIME-Version: 1.0 X-Received: by 10.180.23.36 with SMTP id j4mr56368744wif.69.1424173326106; Tue, 17 Feb 2015 03:42:06 -0800 (PST) Received: by 10.180.83.194 with HTTP; Tue, 17 Feb 2015 03:42:06 -0800 (PST) In-Reply-To: References: Date: Tue, 17 Feb 2015 17:12:06 +0530 Message-ID: Subject: Re: URL/Email tokenizer From: Ravikumar Govindarajan To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=e89a8f83a05933844e050f473100 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f83a05933844e050f473100 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Ian What I am currently doing is duplicating the data into 2 different fields and having my own PerFieldAnalyzerWrapper just like you pointed out Is there a good way to do this in a single-pass? Like how Bi-Grams or Common-Grams do=E2=80=A6 -- Ravi On Tue, Feb 17, 2015 at 3:08 PM, Ian Lea wrote: > Sounds like a job for > org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper. > > > -- > Ian. > > > On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan > wrote: > > We have a requirement in that E-mail addresses need to be added in a > > tokenized form to one field while untokenized form is added to another > field > > > > Ex: > > > > "I have mailed abc@xyz.com" . It should tokenize as below > > > > body =3D {"I", "have", "mailed", "abc", "xyz", "com"}; > > > > I also have a body-addr field. Tokenizer needs to extract e-mail > addresses > > from body field and add them as below > > > > body-addr =3D {"abc@xyz.com"} > > > > How to achieve this via tokenizer chain? > > > > -- > > Ravi > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --e89a8f83a05933844e050f473100--