Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D405D36F for ; Thu, 6 Sep 2012 23:46:38 +0000 (UTC) Received: (qmail 78076 invoked by uid 500); 6 Sep 2012 23:46:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78007 invoked by uid 500); 6 Sep 2012 23:46:35 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 77992 invoked by uid 99); 6 Sep 2012 23:46:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Sep 2012 23:46:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of appy74@dsl.pipex.com designates 212.74.114.14 as permitted sender) Received: from [212.74.114.14] (HELO mk-outboundfilter-6.mail.uk.tiscali.com) (212.74.114.14) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Sep 2012 23:46:29 +0000 X-Trace: 497051051/mk-outboundfilter-6.mail.uk.tiscali.com/PIPEX/$ON_NET_AUTH_ACCEPTED/pipex-customers/81.86.114.41/None/appy74@dsl.pipex.com X-SBRS: None X-RemoteIP: 81.86.114.41 X-IP-MAIL-FROM: appy74@dsl.pipex.com X-SMTP-AUTH: X-Originating-Country: GB/UNITED KINGDOM X-MUA: Microsoft Outlook 14.0 X-IP-BHB: Once X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhMKAJU0SVBRVnIp/2dsb2JhbABFhge0QQOBB4EIgiABAQUIAhkFJggvAQMCBgMRBAEBAQICIwMCAhkIEAgNCQgCBAgHBAEKBQ0Eh10DE6hdiUINiVOBIYkNYxpOgkGCCoESA5QFgmmEOIVIgy2BW4JkgWA X-IronPort-AV: E=Sophos;i="4.80,382,1344207600"; d="scan'208";a="497051051" X-IP-Direction: IN Received: from 81-86-114-41.dsl.pipex.com (HELO Ustane) ([81.86.114.41]) by smtp.pipex.tiscali.co.uk with ESMTP; 07 Sep 2012 00:46:07 +0100 From: "Martin O'Shea" To: References: <006c01cd8abb$8b69b310$a23d1930$@dsl.pipex.com> In-Reply-To: Subject: RE: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene Date: Fri, 7 Sep 2012 00:46:10 +0100 Message-ID: <006101cd8c89$cb4705d0$61d51170$@dsl.pipex.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQFpNSA/LhvPjel17eZd/NUx4GpcGwEQnui4mD4jTwA= Content-Language: en-gb X-Virus-Checked: Checked by ClamAV on apache.org Thanks for that piece of advice. I ended up passing my snowballAnalyzer and standardAnalyzers as = parameters to ShingleFilterWrappers and processing the outputs via a = TermVectorMapper.=20 It seems to work quite well. -----Original Message----- From: Robert Muir [mailto:rcmuir@gmail.com]=20 Sent: 05 Sep 2012 01 53 To: java-user@lucene.apache.org Subject: Re: Using a Lucene ShingleFilter to extract frequencies of = bigrams in Lucene On Tue, Sep 4, 2012 at 12:37 PM, Martin O'Shea = wrote: > > Does anyone know if this can be used in conjunction with other=20 > analyzers to return the frequencies of the bigrams or trigrams found, = e.g.: > > > > "please divide this please divide sentence into shingles" > > > > Would return 2 for "please divide"? > > > > I'm currently using Lucene 3.0.2 to extract frequencies of unigrams=20 > from a string using a combination of a TermVectorMapper and=20 > Standard/Snowball analyzers. > > > > I should add that my strings are built up from a database and then=20 > indexed by Lucene in memory and are not persisted beyond this. Use of=20 > other products like Solr is not intended. > The bigrams etc generated by shingles are terms just like the unigrams. = So you can wrap any other analyzer with a ShingleAnalyzerWrapper if you = want the shingles. If you just want to use Lucene's analyzers to tokenize the text and = compute within-document frequencies for a one-off purpose, I think = indexing and creating term vectors could be overkill: you could just = consume the tokens from the Analyzer and make a hashmap or whatever you = need... There are examples in the org.apache.lucene.analysis package javadocs. -- lucidworks.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org