From java-user-return-54001-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Thu Nov 1 21:19:49 2012 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 71F99D110 for ; Thu, 1 Nov 2012 21:19:49 +0000 (UTC) Received: (qmail 70738 invoked by uid 500); 1 Nov 2012 21:19:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 70693 invoked by uid 500); 1 Nov 2012 21:19:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 70684 invoked by uid 99); 1 Nov 2012 21:19:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 21:19:47 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 21:19:40 +0000 Received: by mail-pb0-f48.google.com with SMTP id wy7so2090880pbc.35 for ; Thu, 01 Nov 2012 14:19:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding :x-gm-message-state; bh=xRls/YqLbc/1T7yRCwTVUKCCyJZr3D7cXi8Qd33lRWA=; b=MsBr5MLeSOejG+EUo8DLgMszSTf2zmBJSOrXd9NCX1gBCKE/ffz+LXcfWlkxpn5tnB EL3zl6dai1UaVMIJIVhmVUnXhhqKGpVu1OMiMSxKWBYO2oaM4gKIPTR0c6+Jr2XTeVgV BDeYOIZvEdpfATtrcKPWxx/j9M0udPXtoXDrViNz5gcJkn4brUH8gYLGhtHzbw3qPtHm Ae3ep2qOVKvNMdv4tGge36yOEaEY+/48lEVSYrv9sZttqNXdWnt/8D3yWu+KoPjJRktn UeY0cSJxGyfkXRlZ5eVRo8yCdK6h4hJWi5H2K0ff4CP+0OT85RhKsVoS3tgUV6/2W2Vw gWVA== Received: by 10.68.217.130 with SMTP id oy2mr125250160pbc.144.1351804758903; Thu, 01 Nov 2012 14:19:18 -0700 (PDT) Received: from [192.168.0.6] (cpe-66-75-78-42.socal.res.rr.com. [66.75.78.42]) by mx.google.com with ESMTPS id vc2sm4555107pbc.64.2012.11.01.14.19.16 (version=SSLv3 cipher=OTHER); Thu, 01 Nov 2012 14:19:17 -0700 (PDT) Message-ID: <5092E74E.2060702@getrailo.org> Date: Thu, 01 Nov 2012 14:19:10 -0700 From: "Igal @ getRailo.org" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Removing Empty Shingles in Lucene 4 References: <5092D101.6060705@getrailo.org> <1EB0C709-7EB8-47B3-8C4E-03281A78F449@gmail.com> In-Reply-To: <1EB0C709-7EB8-47B3-8C4E-03281A78F449@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQlAtU0dn7rgakg19y1IXukNtXmX9B3LxqRtlkBlCL6A2rrOghKh2GtR5xKLBM52qk5Xqst6 X-Virus-Checked: Checked by ClamAV on apache.org hi Steve, you are correct. I am using StandardTokenizer. I will look into the WhitespaceTokenizer and hopefully figure it out. thank you, Igal On 11/1/2012 1:24 PM, Steve Rowe wrote: > Hi Igal, > > You didn't say you were using StandardTokenizer, but assuming you are, right now StandardTokenizer throws away punctuation, so no following filters will see them. > > If StandardTokenizer were modified to also output currently non-tokenized punctuation as tokens, then you could use a FilteringTokenFilter that removes any shingle containing commas. See [1] and [3] for previous discussions on this topic. > > For right now, if you use something like WhitespaceTokenizer, you could have a FilteringTokenFilter to remove shingles with non-final-token commas, and then another filter that strips commas everywhere. > > Steve > > [1] Mike McCandless's post on LUCENE-3940 > > [2] dev@l.a.o thread "Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true" > > On Nov 1, 2012, at 3:44 PM, Igal @ getRailo.org wrote: > >> hi, >> >> I'm trying to migrate to Lucene 4. >> >> in Lucene 3.5 I extended org.apache.lucene.analysis.FilteringTokenFilter and overrode accept() to remove undesired shingles. in Lucene 4 org.apache.lucene.analysis.FilteringTokenFilter does not exist? >> >> I'm trying to achieve two things: >> >> 1) remove shingles that have an empty item. >> >> 2) remove shingles when the phrase contains a comma, for example: >> >> for the phrase: "delicious red apples, green pears, and oranges" >> >> I want the following shingles (with a shingle size of 2): >> >> "delicious red", "red apples", "green pears", "and oranges" >> (no "apples green" because there's a comma) >> (no "pears and" because there's a comma) >> >> any ideas? >> >> TIA >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org