lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Lecarme <math...@garambrogne.net>
Subject Re: shingles and punctuations
Date Tue, 08 Apr 2008 20:06:54 GMT
setting a flag in a filter is easy :

8<-------------------

package org.apache.lucene.analysis.shingle;

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

/**
  * @author Mathieu Lecarme
  *
  */
public class SentenceCutterFilter extends TokenFilter{
   public static final int FLAG = 42;
   public Token previous = null;

   protected SentenceCutterFilter(TokenStream input) {
     super(input);
   }

   public Token next() throws IOException {
     Token current = input.next();
     if(current == null)
       return null;
     if(previous == null || (current.startOffset() -  
previous.endOffset()) > 1)
       current.setFlags(FLAG);
     previous = current;
     return current;
   }
}

8<-------------------
and using it at the right place is tricky :
8<-------------------

     String test = "This is a test, a big test";
     TokenStream stream =
       new StopFilter(
         new ShingleFilter(
           new SentenceCutterFilter(
             new LowerCaseFilter(
               new ISOLatin1AccentFilter(
                   new StandardTokenizer(new StringReader(test))))), 3),
       new String[]{"is", "a"});

8<-------------------

But I must be to tired, but I can't patch the ShingleFilter to handle  
the flag.
I guess flag should be a bit, tested with a mask.

M.



Le 6 avr. 08 à 22:53, Grant Ingersoll a écrit :
> For now, it's up to your app to know, unfortunately :-(  I think the  
> WikipediaTokenizer is the only one using flags currently in the  
> Lucene.
>
>
> On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:
>
>> I'll use Token flags to specifiy first token in a sentence, but how  
>> it's works? how flag collision is avoided? to keep it simple, i'll  
>> take 1 as flag, but what happens if an other filter use the same  
>> flags?
>>
>> M.
>>
>> Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
>>> I think you need sentence detection to take place further  
>>> upstream.  Then you could use the Token type or Token flags to  
>>> indicate punctuation, sentences, whatever and we could patch the  
>>> shingle filter to ignore these things, or break and move onto the  
>>> next one.
>>>
>>> -Grant
>>>
>>> On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:
>>>
>>>> The newly ShingleFilter is very helpful to fetch group of words,  
>>>> but it doesn't handle ponctuation or any separation.
>>>> If you feed it with multiple sentences, you will get shingle that  
>>>> start in one sentences and end in the next.
>>>> In order to avoid that, you can handle token positions, if there  
>>>> is more than one char with the previous token, it should be  
>>>> punctation (or typo).
>>>> Any suggestions to handle only shingle in the same sentence?
>>>>
>>>> M.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message