Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2ADB14E59 for ; Wed, 11 May 2011 16:10:49 +0000 (UTC) Received: (qmail 66084 invoked by uid 500); 11 May 2011 16:10:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 66047 invoked by uid 500); 11 May 2011 16:10:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 66039 invoked by uid 99); 11 May 2011 16:10:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2011 16:10:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of wkoscho@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bw0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2011 16:10:40 +0000 Received: by bwz8 with SMTP id 8so863360bwz.35 for ; Wed, 11 May 2011 09:10:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=2sGqq7Xb+8KzUhKUX7BdL4b131OP1VP5APu2jBQp6pM=; b=hTCulgQI28HLPS1gUB51GSMFe7tmtxFTWTnw87y+SKZocGVQVtmtnplx8X95oQNXsY pKXfTS+QszXFToSfMVvEygRozbQMqgw5sEG72pFPu/zJ7BJ8utRfOAbBJMFi7yTBiGWQ Gdt6AKV48AqfVmkCbGO7I0wg3DUwZ3I352xf4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ejUK0Sa6KzIo3M5uwK+e7/c0hdk6/4iPLqTGiP86zCFiM5vkBJ6xBTYGz9OMzYVNZq CmJh0cWP7CR5fcEqkgOn/rT7mG9eYdVQhMneOxbg9BzVAVVlXMTVeVnD8FZVQkKn0psT LOH2VqGr0q83KA438/QS5G12vQdBxDymo1UCY= MIME-Version: 1.0 Received: by 10.204.25.20 with SMTP id x20mr943925bkb.112.1305130219809; Wed, 11 May 2011 09:10:19 -0700 (PDT) Received: by 10.204.35.196 with HTTP; Wed, 11 May 2011 09:10:19 -0700 (PDT) In-Reply-To: References: <2D127F11DC79714E9B6A43AC9458147FB3CE7396@suex07-mbx-03.ad.syr.edu> <2D127F11DC79714E9B6A43AC9458147FB3CE739E@suex07-mbx-03.ad.syr.edu> Date: Wed, 11 May 2011 12:10:19 -0400 Message-ID: Subject: Re: Can I omit ShingleFilter's filler tokens From: William Koscho To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I meant I'm trying for #2 so this should work (got my numbers mixed up). Thanks again Bill On 5/11/11, William Koscho wrote: > #1 is what I'm trying for, so Ill give setPositionIncrements(false) a > try. Thanks for everyone's help. > > Bill > > On 5/11/11, Steven A Rowe wrote: >> Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly >> get >> higher throughput than inserting PositionFilter. Like PositionFilter, >> this >> will buy you #2 (create shingles as if stopwords were never there), but >> not >> #1 (don't create shingles across stopwords). >> >>> -----Original Message----- >>> From: Robert Muir [mailto:rcmuir@gmail.com] >>> Sent: Wednesday, May 11, 2011 9:02 AM >>> To: java-user@lucene.apache.org >>> Subject: Re: Can I omit ShingleFilter's filler tokens >>> >>> another idea is to .setEnablePositionIncrements(false) on your >>> stopfilter. >>> >>> On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: >>> > Hi Bill, >>> > >>> > I can think of two possible interpretations of "removing filler >>> tokens": >>> > >>> > 1. Don't create shingles across stopwords, e.g. for text "one two >>> > three >>> four five" and stopword "three", bigrams only, you'd get ("one two", >>> "four five"), instead of the current ("one two", "two _", "_ four", >>> "four >>> five"). >>> > >>> > 2. Create shingles as if the stopwords were never there, e.g. for the >>> same text and stopword, bigrams only, you'd get ("one two", "two four", >>> "four five"). >>> > >>> > Which one did you have in mind? =A0#2 can be achieved by adding >>> PositionFilter after StopFilter and before ShingleFilter. =A0I think #1 >>> requires ShingleFilter modifications. >>> > >>> > Steve >>> > >>> >> -----Original Message----- >>> >> From: William Koscho [mailto:wkoscho@gmail.com] >>> >> Sent: Wednesday, May 11, 2011 12:05 AM >>> >> To: java-user@lucene.apache.org >>> >> Subject: Can I omit ShingleFilter's filler tokens >>> >> >>> >> Hi, >>> >> >>> >> Can I remove the filler token _ from the n-gram-tokens that are >>> generated >>> >> by >>> >> a ShingleFilter? >>> >> >>> >> I'm using a chain of filters: ClassicFilter, StopFilter, >>> LowerCaseFilter, >>> >> and ShingleFilter to create phrase n-grams. =A0The ShingleFilter >>> >> inserts >>> >> FILLER_TOKENs in place of the stopwords, but I don't want them. >>> >> >>> >> How can I omit the filler tokens? >>> >> >>> >> thanks >>> >> Bill >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > > -- > Sent from my mobile device > --=20 Sent from my mobile device --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org