lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Possible to define a field so that substring-search is always used?
Date Wed, 25 Jul 2018 15:01:12 GMT
> I think n-grams sounds like the only way to get this done.

You don't have to settle for "the only way". You can totally have the
same field(s) copyFielded into multiple locations and then have each
target field use a different indexing pipeline, including ngrams,
phonetic processing, full match with/without "@domain" part, etc.
Then, with eDismax multi-field searches and/or boost queries you can
give higher boost to the copies with least amount of processing and
lower boost to less-precise, more-inclusive matches.

Regards,
   Alex.

On 25 July 2018 at 09:23, Christopher Schultz
<chris@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Chris,
>
> On 7/24/18 4:46 PM, Chris Hostetter wrote:
>>
>> : We are using Solr as a user index, and users have email
>> addresses. : : Our old search behavior used a SQL substring match
>> for any search : terms entered, and so users are used to being able
>> to search for e.g. : "chr" and finding my email address
>> ("chris@christopherschultz.net"). : : By default, Solr doesn't
>> perform substring matches, and it might be : difficult to re-train
>> users to use *chr* to find email addresses by : substring.
>>
>> In the past, were you really doing arbitrary substring matching, or
>> just prefix matching?  ie would a search for "sto" match
>> "chris@christopherschultz.net"
>
> Yes. Searching for "sto" would result in a SQL query with a " WHERE
> ... LIKE '%sto%'" clause. So it was slow as hell, of course.
>
>> Personally, if you know you have an email field, would suggest
>> using a custom tokenizer that splits on "@" and "." (and maybe
>> other punctuation characters like "-") and then take your raw user
>> input and feed it to the prefix parser (instead of requiring your
>> users to add the "*")...
>>
>> q={!prefix f=email v=$user_input}&user_input=chr
>>
>> ...which would match chris@gmail.com, foo@chris.com, foo@bar.chr
>> etc.
>>
>> (this wouldn't help you though if you *really* want arbitrary
>> substring matching -- as erick suggested ngrams is pretty much your
>> best bet for something like that)
>>
>> Bear in mind, you can combine that "forced prefix" query against
>> the (otkenized) email field with other queries that could parse
>> your input in other ways...
>>
>> user_input=... q=({!prefix f=email v=$user_input} OR {!dismax
>> qf="first_name last_name" ..etc.. v=$user_input})
>>
>> so if your user input is "chris" you'll get term matches on the
>> first_name field, or the last_name field as well as prefix matches
>> on the email field.
>
> The problem is that our users (admins) sometimes need to locate users
> by their email address, and people often forget the exact spelling. So
> they'll call and say "I can't get in" and we have to search for "chris
> schultz" and then "chris" and then it turns out that their email
> address was actually sexylover42@yahoo.com, so they often have to try
> a bunch of searches before finding the right user record. Having to
> search for "sexylover42", a complete-match word, isn't going to work
> for their use-case. They need to be able to search for "lover" and
> have it work. I think n-grams sounds like the only way to get this
> done. I'll have to play-around with it a little bit to see how it behave
> s.
>
> Thanks,
> - -chris

Mime
View raw message