Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of appy74@dsl.pipex.com
 designates 212.74.114.14 as permitted sender)
From: "Martin O'Shea" <appy74@dsl.pipex.com>
To: <java-user@lucene.apache.org>
References: <006c01cd8abb$8b69b310$a23d1930$@dsl.pipex.com>
 <CAOdYfZXKr7qKou4PsxZt3E=h7wuwWUe3hhDFFCqoV6uQ3Gzb2Q@mail.gmail.com>
In-Reply-To: 
 <CAOdYfZXKr7qKou4PsxZt3E=h7wuwWUe3hhDFFCqoV6uQ3Gzb2Q@mail.gmail.com>
Subject: RE: Using a Lucene ShingleFilter to extract frequencies of bigrams in
 Lucene
Date: Fri, 7 Sep 2012 00:46:10 +0100
Message-ID: <006101cd8c89$cb4705d0$61d51170$@dsl.pipex.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: AQFpNSA/LhvPjel17eZd/NUx4GpcGwEQnui4mD4jTwA=
Content-Language: en-gb

Thanks for that piece of advice.

 I ended up passing my snowballAnalyzer and standardAnalyzers as =
parameters to ShingleFilterWrappers and processing the outputs via a =
TermVectorMapper.=20

It seems to work quite well.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com]=20
Sent: 05 Sep 2012 01 53
To: java-user@lucene.apache.org
Subject: Re: Using a Lucene ShingleFilter to extract frequencies of =
bigrams in Lucene

On Tue, Sep 4, 2012 at 12:37 PM, Martin O'Shea <appy74@dsl.pipex.com> =
wrote:
>
> Does anyone know if this can be used in conjunction with other=20
> analyzers to return the frequencies of the bigrams or trigrams found, =
e.g.:
>
>
>
>     "please divide this please divide sentence into shingles"
>
>
>
> Would return 2 for "please divide"?
>
>
>
> I'm currently using Lucene 3.0.2 to extract frequencies of unigrams=20
> from a string using a combination of a TermVectorMapper and=20
> Standard/Snowball analyzers.
>
>
>
> I should add that my strings are built up from a database and then=20
> indexed by Lucene in memory and are not persisted beyond this. Use of=20
> other products like Solr is not intended.
>

The bigrams etc generated by shingles are terms just like the unigrams. =
So you can wrap any other analyzer with a ShingleAnalyzerWrapper if you =
want the shingles.

If you just want to use Lucene's analyzers to tokenize the text and =
compute within-document frequencies for a one-off purpose, I think =
indexing and creating term vectors could be overkill: you could just =
consume the tokens from the Analyzer and make a hashmap or whatever you =
need...

There are examples in the org.apache.lucene.analysis package javadocs.

--
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org