lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: shingles and punctuations
Date Mon, 07 Apr 2008 12:32:37 GMT
Hi Mathieu,

>From the class comment for ShingleFilter:

  This filter handles position increments > 1 by inserting
  filler tokens (tokens with termtext "_"). It does not
  handle a position increment of 0.

You could use feature this by setting (in an upstream filter) the positionIncrement of each
sentence-starting word be at least as large as the maximum shingle size.  This would result
in sentence-ending shingles like ". _" and sentence-beginning shingles like "_ Word".

Steve

On 04/06/2008 at 1:23 PM, Mathieu Lecarme wrote:
> The newly ShingleFilter is very helpful to fetch group of words, but
> it doesn't handle ponctuation or any separation.
> If you feed it with multiple sentences, you will get shingle that
> start in one sentences and end in the next.
> In order to avoid that, you can handle token positions, if there is
> more than one char with the previous token, it should be punctation
> (or typo).
> Any suggestions to handle only shingle in the same sentence?
> 
> M.
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message