From java-dev-return-25195-apmail-lucene-java-dev-archive=lucene.apache.org@lucene.apache.org Sun Apr 06 20:44:04 2008 Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 25641 invoked from network); 6 Apr 2008 20:44:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Apr 2008 20:44:04 -0000 Received: (qmail 99925 invoked by uid 500); 6 Apr 2008 20:43:58 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 99873 invoked by uid 500); 6 Apr 2008 20:43:58 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 99862 invoked by uid 99); 6 Apr 2008 20:43:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Apr 2008 13:43:58 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.27.42.28] (HELO smtp2-g19.free.fr) (212.27.42.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Apr 2008 20:43:07 +0000 Received: from smtp2-g19.free.fr (localhost.localdomain [127.0.0.1]) by smtp2-g19.free.fr (Postfix) with ESMTP id CF51D12B6E5 for ; Sun, 6 Apr 2008 22:43:26 +0200 (CEST) Received: from [192.168.1.100] (ze.garambrogne.net [82.227.122.98]) by smtp2-g19.free.fr (Postfix) with ESMTP id 957F712B731 for ; Sun, 6 Apr 2008 22:43:26 +0200 (CEST) Message-Id: <75E15E40-3094-4388-908F-28E8A8B8DD54@garambrogne.net> From: Mathieu Lecarme To: java-dev@lucene.apache.org In-Reply-To: <4C602DB6-5F9E-43EF-81BC-047FA9F38DA3@apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: shingles and punctuations Date: Sun, 6 Apr 2008 22:43:25 +0200 References: <4C602DB6-5F9E-43EF-81BC-047FA9F38DA3@apache.org> X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org I'll use Token flags to specifiy first token in a sentence, but how =20 it's works? how flag collision is avoided? to keep it simple, i'll =20 take 1 as flag, but what happens if an other filter use the same flags? M. Le 6 avr. 08 =E0 20:13, Grant Ingersoll a =E9crit : > I think you need sentence detection to take place further upstream. =20= > Then you could use the Token type or Token flags to indicate =20 > punctuation, sentences, whatever and we could patch the shingle =20 > filter to ignore these things, or break and move onto the next one. > > -Grant > > On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote: > >> The newly ShingleFilter is very helpful to fetch group of words, =20 >> but it doesn't handle ponctuation or any separation. >> If you feed it with multiple sentences, you will get shingle that =20 >> start in one sentences and end in the next. >> In order to avoid that, you can handle token positions, if there is =20= >> more than one char with the previous token, it should be punctation =20= >> (or typo). >> Any suggestions to handle only shingle in the same sentence? >> >> M. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org