Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 73662D87F for ; Tue, 4 Sep 2012 16:37:57 +0000 (UTC) Received: (qmail 29495 invoked by uid 500); 4 Sep 2012 16:37:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 29392 invoked by uid 500); 4 Sep 2012 16:37:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 29382 invoked by uid 99); 4 Sep 2012 16:37:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 16:37:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of appy74@dsl.pipex.com designates 212.74.114.14 as permitted sender) Received: from [212.74.114.14] (HELO mk-outboundfilter-6.mail.uk.tiscali.com) (212.74.114.14) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 16:37:44 +0000 X-Trace: 496452367/mk-outboundfilter-6.mail.uk.tiscali.com/PIPEX/$ON_NET_AUTH_ACCEPTED/pipex-customers/81.86.114.41/None/appy74@dsl.pipex.com X-SBRS: None X-RemoteIP: 81.86.114.41 X-IP-MAIL-FROM: appy74@dsl.pipex.com X-SMTP-AUTH: X-Originating-Country: GB/UNITED KINGDOM X-MUA: Microsoft Outlook 14.0 X-IP-BHB: Once X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av4EAM8tRlBRVnIp/2dsb2JhbABFgkq4YIEIgicIAh4FKTAFBmIgHwEEHgWIAZk6oSGPI4McA5sliHOBW4Jk X-IronPort-AV: E=Sophos;i="4.80,368,1344207600"; d="scan'208,217";a="496452367" X-IP-Direction: IN Received: from 81-86-114-41.dsl.pipex.com (HELO Ustane) ([81.86.114.41]) by smtp.pipex.tiscali.co.uk with ESMTP; 04 Sep 2012 17:37:17 +0100 From: "Martin O'Shea" To: Subject: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene Date: Tue, 4 Sep 2012 17:37:16 +0100 Message-ID: <006c01cd8abb$8b69b310$a23d1930$@dsl.pipex.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_006D_01CD8AC3.ED2FC8C0" X-Mailer: Microsoft Outlook 14.0 Thread-Index: Ac2Kuv+EJSYLUrh/Q3GHG/kF5QuNEw== Content-Language: en-gb ------=_NextPart_000_006D_01CD8AC3.ED2FC8C0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit If a Lucene ShingleFilter can be used to tokenize a string into shingles, or ngrams, of different sizes, e.g.: "please divide this sentence into shingles" Becomes: shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles" Does anyone know if this can be used in conjunction with other analyzers to return the frequencies of the bigrams or trigrams found, e.g.: "please divide this please divide sentence into shingles" Would return 2 for "please divide"? I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a string using a combination of a TermVectorMapper and Standard/Snowball analyzers. I should add that my strings are built up from a database and then indexed by Lucene in memory and are not persisted beyond this. Use of other products like Solr is not intended. Thanks Mr Morgan. ------=_NextPart_000_006D_01CD8AC3.ED2FC8C0--