setting a flag in a filter is easy :
8<-------------------
package org.apache.lucene.analysis.shingle;
import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
/**
* @author Mathieu Lecarme
*
*/
public class SentenceCutterFilter extends TokenFilter{
public static final int FLAG = 42;
public Token previous = null;
protected SentenceCutterFilter(TokenStream input) {
super(input);
}
public Token next() throws IOException {
Token current = input.next();
if(current == null)
return null;
if(previous == null || (current.startOffset() -
previous.endOffset()) > 1)
current.setFlags(FLAG);
previous = current;
return current;
}
}
8<-------------------
and using it at the right place is tricky :
8<-------------------
String test = "This is a test, a big test";
TokenStream stream =
new StopFilter(
new ShingleFilter(
new SentenceCutterFilter(
new LowerCaseFilter(
new ISOLatin1AccentFilter(
new StandardTokenizer(new StringReader(test))))), 3),
new String[]{"is", "a"});
8<-------------------
But I must be to tired, but I can't patch the ShingleFilter to handle
the flag.
I guess flag should be a bit, tested with a mask.
M.
Le 6 avr. 08 à 22:53, Grant Ingersoll a écrit :
> For now, it's up to your app to know, unfortunately :-( I think the
> WikipediaTokenizer is the only one using flags currently in the
> Lucene.
>
>
> On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:
>
>> I'll use Token flags to specifiy first token in a sentence, but how
>> it's works? how flag collision is avoided? to keep it simple, i'll
>> take 1 as flag, but what happens if an other filter use the same
>> flags?
>>
>> M.
>>
>> Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
>>> I think you need sentence detection to take place further
>>> upstream. Then you could use the Token type or Token flags to
>>> indicate punctuation, sentences, whatever and we could patch the
>>> shingle filter to ignore these things, or break and move onto the
>>> next one.
>>>
>>> -Grant
>>>
>>> On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:
>>>
>>>> The newly ShingleFilter is very helpful to fetch group of words,
>>>> but it doesn't handle ponctuation or any separation.
>>>> If you feed it with multiple sentences, you will get shingle that
>>>> start in one sentences and end in the next.
>>>> In order to avoid that, you can handle token positions, if there
>>>> is more than one char with the previous token, it should be
>>>> punctation (or typo).
>>>> Any suggestions to handle only shingle in the same sentence?
>>>>
>>>> M.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|