lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igal @ getRailo.org" <i...@getrailo.org>
Subject Re: using CharFilter to inject a space
Date Sun, 04 Nov 2012 03:35:47 GMT
well, my main goal is to use a ShingleFilter that will only take 
shingles that are not separated by commas etc.

for example, the phrase:

     "red apples, green tomatoes, and brown potatoes"

should yield the shingles "red apples", "green tomatoes", "and brown", 
"brown potatoes"; but not "apples green" and not "tomatoes and" as those 
are separated by commas.

the problem with the common tokenizers is that they get rid of the 
commas so if I use a ShingleFilter after them there's no way to tell if 
there was a comma there or not.

(another option I consider is to add an Attribute to specify if there 
was a comma before or after a token)

if there's a better way -- I'm open to suggestions,


Igal


On 11/3/2012 8:10 PM, Erick Erickson wrote:
> So I've gotta ask... _why_ do you want to inject the spaces?
> If it's just to break this up into tokens,  wouldn't something like
> LetterTokenizer do? Assuming you aren't interested in
> leaving in numbers.... Or even StandardTokenizer unless you have
> e-mail & etc.
>
> Or what about PatternReplaceCharFilter?
>
> FWIW,
> Erick
>
>
>
> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <igal@getrailo.org> wrote:
>
>> You're right.  I'm not sure what I was thinking.
>>
>> Thanks for all your help,
>>
>> Igal
>>   On Nov 3, 2012 5:44 PM, "Robert Muir" <rcmuir@gmail.com> wrote:
>>
>>> On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <igal@getrailo.org>
>>> wrote:
>>>> hi Robert,
>>>>
>>>> thank you for your replies.
>>>>
>>>> I couldn't find much documentation/examples of this, but this is what I
>>> came
>>>> up with (below).  is that the way I'm supposed to use the
>>> MappingCharFilter?
>>> You don't need to extend anything.
>>> You also don't want to create a normalizecharmap for each reader
>>> (thats way too heavy)
>>>
>>> Just build the NormalizeCharMap once, and pass it to
>>> MappingCharFilter's Constructor.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message