lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igal @" <>
Subject Re: using CharFilter to inject a space
Date Sun, 04 Nov 2012 03:35:47 GMT
well, my main goal is to use a ShingleFilter that will only take 
shingles that are not separated by commas etc.

for example, the phrase:

     "red apples, green tomatoes, and brown potatoes"

should yield the shingles "red apples", "green tomatoes", "and brown", 
"brown potatoes"; but not "apples green" and not "tomatoes and" as those 
are separated by commas.

the problem with the common tokenizers is that they get rid of the 
commas so if I use a ShingleFilter after them there's no way to tell if 
there was a comma there or not.

(another option I consider is to add an Attribute to specify if there 
was a comma before or after a token)

if there's a better way -- I'm open to suggestions,


On 11/3/2012 8:10 PM, Erick Erickson wrote:
> So I've gotta ask... _why_ do you want to inject the spaces?
> If it's just to break this up into tokens,  wouldn't something like
> LetterTokenizer do? Assuming you aren't interested in
> leaving in numbers.... Or even StandardTokenizer unless you have
> e-mail & etc.
> Or what about PatternReplaceCharFilter?
> Erick
> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <> wrote:
>> You're right.  I'm not sure what I was thinking.
>> Thanks for all your help,
>> Igal
>>   On Nov 3, 2012 5:44 PM, "Robert Muir" <> wrote:
>>> On Sat, Nov 3, 2012 at 8:32 PM, Igal @ <>
>>> wrote:
>>>> hi Robert,
>>>> thank you for your replies.
>>>> I couldn't find much documentation/examples of this, but this is what I
>>> came
>>>> up with (below).  is that the way I'm supposed to use the
>>> MappingCharFilter?
>>> You don't need to extend anything.
>>> You also don't want to create a normalizecharmap for each reader
>>> (thats way too heavy)
>>> Just build the NormalizeCharMap once, and pass it to
>>> MappingCharFilter's Constructor.
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message