lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin O'Shea" <app...@dsl.pipex.com>
Subject Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene
Date Tue, 04 Sep 2012 16:37:16 GMT
If a Lucene ShingleFilter can be used to tokenize a string into shingles, or
ngrams, of different sizes, e.g.:

 

    "please divide this sentence into shingles"

 

Becomes:

 

    shingles "please divide", "divide this", "this sentence", "sentence
into", and "into shingles"

 

Does anyone know if this can be used in conjunction with other analyzers to
return the frequencies of the bigrams or trigrams found, e.g.:

 

    "please divide this please divide sentence into shingles"

 

Would return 2 for "please divide"?

 

I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a
string using a combination of a TermVectorMapper and Standard/Snowball
analyzers.

 

I should add that my strings are built up from a database and then indexed
by Lucene in memory and are not persisted beyond this. Use of other products
like Solr is not intended.

 

Thanks

 

Mr Morgan.

 

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message