lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Strip out punctuation at the end of token
Date Mon, 27 Nov 2017 09:45:23 GMT
Hi Sergio,
Is this the only case that needs “special” handling? If you are only after matching phone
numbers then you need to think about both false negatives and false positives. E.g. if you
go with only WDFF you will end up with ‘008’ token. That means that you will also return
this doc for any query like XXXXX-008 which is not expected behaviour. I guess that you will
need to do a bit of regex to clean up number and as Erick explained, you need to focus on
tokens that will end up in index and make sure the right tokens are produced for different
queries.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Nov 2017, at 19:35, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> You need to play with the (many) parameters for WordDelimiterFilterFactory.
> 
> For instance, you have preserveOriginal set to 1. That's what's
> generating the token with the dot.
> 
> You have catenateAll and catenateNumbers set to zero. That means that
> someone searching for 61149008 won't get a hit.
> 
> The fact that the dot is in the tokens generated doesn't really matter
> as long as the query tokens produced will match.
> 
> I think you're getting a bit off track by focusing on the hyphen and
> dot, you're only seeing them in the index at all since you have
> preserveOriginal set to 1. Let's say that you set preserveOriginal to
> 0 and catenateNumbers to 1. Then you'd get:
> 61149
> 008
> 61149008
> 
> in your index. No dots, no hyphens.
> 
> Not your _query_ analysis also has catenateNumbers as 1 and
> preserveOriginal as 0. The user searches for
> 61149-008
> 
> and the emitted tokens are in the index and you're OK. The user
> searches for 61149008 and gets a hit there too. The dot is irrelevant.
> 
> now, all that said if that isn't comfortable you could certainly add
> PatternReplaceFilterFactory, but really WDFF is designed for this kind
> of thing, I think you'll be just fine if you play with the options
> enough to understand the nuances, which can be tricky I'll admit..
> 
> 
> Best,
> Erick
> 
> On Fri, Nov 24, 2017 at 7:13 AM, Sergio García Maroto
> <marotosg@gmail.com> wrote:
>> Yes. You are right. I understand now.
>> Let me explain my issue a bit better with the exact problem i have.
>> 
>> I have this text "Information number  61149-008."
>> Using the tokenizers and filters described previously i get this list of
>> tokens.
>> information
>> number
>> 61149-008.
>> 61149
>> 008
>> 
>> Basically last token   "61149-008."  gets tokenized as
>> 61149-008.
>> 61149
>> 008
>> User is searching for "61149-008" without dot, so this is not a match.
>> I don't want to change the tokenization on the query to avoid altering the
>> matches for other cases.
>> 
>> I would like to delete the dot at the end. Basically generate this extra
>> token
>> information
>> number
>> 61149-008.
>> 61149
>> 008
>> 61149-008
>> 
>> Not sure if what I am saying make sense or there is other way to do this
>> right.
>> 
>> Thanks a lot
>> Sergio
>> 
>> 
>> On 24 November 2017 at 15:31, Shawn Heisey <apache@elyograg.org> wrote:
>> 
>>> On 11/24/2017 2:32 AM, marotosg wrote:
>>> 
>>>> Hi Shaw.
>>>> Thanks for your reply. Actually my issue is with the last token. It looks
>>>> like for the last token of a string. It keeps the dot.
>>>> 
>>>> In your case Testing. This is a test. Test.
>>>> 
>>>> Keeps the "Test."
>>>> 
>>>> Is there any reason I can't see for that behauviour?
>>>> 
>>> 
>>> I am really not sure what you're saying here.
>>> 
>>> Every token is duplicated, one has the dot and one doesn't.  This is what
>>> you wanted based on what I read in your initial email.
>>> 
>>> Making a guess as to what you're asking about this time: If you're
>>> noticing that there isn't a "Test" as the last token on the line for WDF,
>>> then I have to tell you that it actually is there, the display was simply
>>> too wide for the browser window. Scrolling horizontally would be required
>>> to see the whole thing.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 


Mime
View raw message