lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Strange behavior of ShingleFilter in Lucene 4.6
Date Wed, 02 Apr 2014 18:40:59 GMT
Did you really mean to shingle twice (shingleanalyzerwrapper just
wraps the analyzer with a shinglefilter, then the code wraps that with
another shinglefilter again) ?

On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly
<natalia.v.connolly@gmail.com> wrote:
> Hello,
>
>    I am very confused about what ShingleFilter seems to be doing in Lucene
> 4.6.  What I would like to do is extract all possible bigrams from a
> sentence.  So if the sentence is "This is a dog", I want "This is", "is a
> ", "a dog".
>
>     Here is my code:
>
>    StringTokenizer itr = new StringTokenizer(theText,"\n");
>    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
>    ShingleAnalyzerWrapper shingleAnalyzer = new
> ShingleAnalyzerWrapper(analyzer,2,2);
>
>    while (itr.hasMoreTokens()) {
>
>     String theSentence = itr.nextToken();
>     StringReader reader = new StringReader(theSentence);
>     TokenStream tokenStream = shingleAnalyzer.tokenStream("content",
> reader);
>     ShingleFilter theFilter = new ShingleFilter(tokenStream);
>     theFilter.setOutputUnigrams(false);
>
>     CharTermAttribute charTermAttribute =
> theFilter.addAttribute(CharTermAttribute.class);
>
>     theFilter.reset();
>
>      while (theFilter.incrementToken()) {
>
>                 System.out.println(charTermAttribute.toString());
>
>      }
>
>      theFilter.end();
>      theFilter.close();
>   }
>
>
>    What I see in the output is this: suppose the sentence is "resting
> comfortably and in no distress".  I get the following output:
>
> resting resting comfortably
> resting comfortably comfortably
> comfortably comfortably _
> comfortably _ _ distress
> _ distress distress
>
>    So it looks like not only do I not get bigrams, I get spurious 3-grams
> by repeating words.  Could someone please help?
>
>     Thanks much,
>
>     Natalia Connolly

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message