lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: using CharFilter to inject a space
Date Sun, 04 Nov 2012 13:06:09 GMT
Ahh, I don't know of a better way. I can imagine complex solutions
involving something akin to WordDelimiterFilter... and I can imagine that
that would be ridiculously expensive to maintain when there are really
simple solutions like you're looking at.

Mostly I was curious about your use-case....

Erick


On Sat, Nov 3, 2012 at 11:35 PM, Igal @ getRailo.org <igal@getrailo.org>wrote:

> well, my main goal is to use a ShingleFilter that will only take shingles
> that are not separated by commas etc.
>
> for example, the phrase:
>
>     "red apples, green tomatoes, and brown potatoes"
>
> should yield the shingles "red apples", "green tomatoes", "and brown",
> "brown potatoes"; but not "apples green" and not "tomatoes and" as those
> are separated by commas.
>
> the problem with the common tokenizers is that they get rid of the commas
> so if I use a ShingleFilter after them there's no way to tell if there was
> a comma there or not.
>
> (another option I consider is to add an Attribute to specify if there was
> a comma before or after a token)
>
> if there's a better way -- I'm open to suggestions,
>
>
> Igal
>
>
>
> On 11/3/2012 8:10 PM, Erick Erickson wrote:
>
>> So I've gotta ask... _why_ do you want to inject the spaces?
>> If it's just to break this up into tokens,  wouldn't something like
>> LetterTokenizer do? Assuming you aren't interested in
>> leaving in numbers.... Or even StandardTokenizer unless you have
>> e-mail & etc.
>>
>> Or what about PatternReplaceCharFilter?
>>
>> FWIW,
>> Erick
>>
>>
>>
>> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <igal@getrailo.org> wrote:
>>
>>  You're right.  I'm not sure what I was thinking.
>>>
>>> Thanks for all your help,
>>>
>>> Igal
>>>   On Nov 3, 2012 5:44 PM, "Robert Muir" <rcmuir@gmail.com> wrote:
>>>
>>>  On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <igal@getrailo.org>
>>>> wrote:
>>>>
>>>>> hi Robert,
>>>>>
>>>>> thank you for your replies.
>>>>>
>>>>> I couldn't find much documentation/examples of this, but this is what
I
>>>>>
>>>> came
>>>>
>>>>> up with (below).  is that the way I'm supposed to use the
>>>>>
>>>> MappingCharFilter?
>>>> You don't need to extend anything.
>>>> You also don't want to create a normalizecharmap for each reader
>>>> (thats way too heavy)
>>>>
>>>> Just build the NormalizeCharMap once, and pass it to
>>>> MappingCharFilter's Constructor.
>>>>
>>>> ------------------------------**------------------------------**
>>>> ---------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
>>>> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>>>>
>>>>
>>>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message