lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Becker, Thomas" <Thomas.Bec...@netapp.com>
Subject RE: Partial word match using n-grams
Date Tue, 30 Jul 2013 14:01:27 GMT
Just to close the loop on this, I upgraded to 4.4 and the improvements to the NGramTokenizer
were just what I needed.  I switched to using 1-2 grams (the default), and now that the tokenizer
emits the tokens in an order that makes sense I'm in business.  At search time I split on
whitespace, ngram the results and AND them together.   So matching quota_tommy with quo tom
works as expected.  The ngram improvements are much appreciated!


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, July 19, 2013 2:42 PM
To: java-user
Subject: Re: Partial word match using n-grams

Well, it depends on what you put between your tokenizer and ngram filter. Putting WordDelimiterFilterFactory
would break up on the underscore (and lots of other things besides) and submit the separate
tokens which would then be n-grammed separately. That has other implications, of course, but
you get the idea....

There are a zillion possibilities here in terms of combining various filterFactories....

Best
Erick

On Fri, Jul 19, 2013 at 9:06 AM, Becker, Thomas <Thomas.Becker@netapp.com> wrote:
> Sorry, at indexing time it's not broken on anything.  In other words quota_tommy yields
these tokens: "quo uot ota ta_ a_t _to tom omm mmy"  I've thought about trying to determine
boundaries and breaking on them at indexing time, but that will require some more thought.
 It doesn't have to be an underscore, that's only one possible convention.
>
> -----Original Message-----
> From: Shai Erera [mailto:serera@gmail.com]
> Sent: Friday, July 19, 2013 8:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Partial word match using n-grams
>
> Wait, I didn't mean to pad the entire string. If the string is broken on _ already, then
NGramFilter already receives the individual terms and you can put a Filter in front that will
pass through a padded token?
>
> Shai
>
>
> On Fri, Jul 19, 2013 at 3:45 PM, Becker, Thomas <Thomas.Becker@netapp.com>wrote:
>
>> In general the data for this field is that simple, but additional 
>> characters are allowed beyond [a-z_].  Do I need to tokenize on whitespace?
>>  I really don't know.  Essentially, the question is whether we expect 
>> "quota tom" to match quota_tom or not.  I spoke to some colleagues 
>> and they thought it should since both "quota" and "tom" are partial 
>> matches that would AND together.  Tokenizing the entire input 
>> whitespace and all precludes this match.  I'd appreciate some input 
>> from anyone on what the best user experience would be here; I'm 
>> trying to operate on principle of least surprise ;)
>>
>> With regard to the padding suggestion, I'm still not sure this will work.
>>  Because again at indexing time there is typically no whitespace.  So 
>> padding "quota_tommy_1234" to "## quota_tommy_1234##" before 
>> trigramming is not going to produce a to#  token that I would need in order for "quota
to"
>> to match.
>>
>> -----Original Message-----
>> From: Allison, Timothy B. [mailto:tallison@mitre.org]
>> Sent: Friday, July 19, 2013 7:58 AM
>> To: java-user@lucene.apache.org
>> Subject: RE: Partial word match using n-grams
>>
>> Got it...almost.
>>
>> Y. You're right. FuzzyQuery is not at all what you want.
>>
>> Don't know if your data is actually as simple as this example.  Do you
>> need to tokenize on whitespace?   Would it make sense to replace spaces in
>> the query with underscores and then trigramify the whole query as if 
>> it were a single term?
>>
>> ________________________________________
>> From: Becker, Thomas [Thomas.Becker@netapp.com]
>> Sent: Thursday, July 18, 2013 8:59 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: Partial word match using n-grams
>>
>> Thanks for the reply Tim.  I really should have been clearer.  Let's 
>> say I have an object named "quota_tommy_1234".  I'd like to match 
>> that object with any 3 character (or more) substring of that name.  So for example:
>>
>> quo
>> tom
>> 234
>> quota
>> etc.
>>
>> Further, at search time I'm splitting input on whitespace before 
>> tokenizing into PhraseQueries and then ANDing them together.  So 
>> using the example above I also want the following queries to match:
>>
>> quo tom
>> quo 234
>> quota to <- this is the problem because there are no trigrams of "to"
>>
>> That said, in response to your points:
>>
>> 1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
>> misspellings, which is the main function of FuzzyQuery is it not?
>>
>> 2) The original names are all going to be > 3 characters, so there 
>> are no
>> 1 or 2 letter terms at indexing time.  So generating the bigram "to"
>> at search time will never match anything, unless I switch to bigrams 
>> at indexing time also, which is what I'm asking about.
>>
>> 3)  Again the names are all > 3 characters so I don't need to pad at 
>> indexing time.
>>
>> 4) Hopefully my explanation above clarifies.
>>
>> I should point out that I'm a Lucene novice and am not at all sure 
>> that what I'm doing is optimal.  But I have been impressed with how 
>> easy it is to get something working very quickly!
>>
>> ________________________________________
>> From: Allison, Timothy B. [tallison@mitre.org]
>> Sent: Thursday, July 18, 2013 7:49 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: Partial word match using n-grams
>>
>> Tommy,
>>   I'm sure that I don't fully understand your use case and your data.
>>  Some thoughts:
>>
>> 1) I assume that fuzzy term search (edit distance <= 2) isn't meeting 
>> your needs or else you wouldn't have gone the ngram route.  If fuzzy 
>> term search
>> + phrase/proximity search would meet your needs, see if
>> ComplexPhraseQueryParser would work (although it looks like you're 
>> already building your own queries).
>>
>> 2) Would it make sense to modify NGramFilter so that it outputs a 
>> bigram for a two letter term and a unigram for a one letter term?
>> Might be messy...and "ab" in this scenario would never match "abc"
>>
>> 3) Would it make sense to pad your terms behind the scenes with 
>> "##"...this would add bloat, but not nearly as much as variable gram 
>> sizes with 1<= n <=3
>>
>> ab -> ##ab## yields trigrams ##a, #ab, ab#, b##
>>
>> 4) How partial and what types of partial do you need?  This is 
>> related to 1).  If minimum edit distance is sufficient; use it, 
>> especially with the blazing fast automaton (thank you, Robert Muir). 
>> If you have a smallish dataset you might consider allowing leading 
>> wildcards so that you could easily find all words, for example, 
>> containing abc with *abc*.  If your dataset is larger, you might 
>> consider something like ReversedWildcardFilterFactory (Solr) to speed this type of
matching.
>>
>> I look forward to other opinions from the list.
>>
>> -----Original Message-----
>> From: Becker, Thomas [mailto:Thomas.Becker@netapp.com]
>> Sent: Thursday, July 18, 2013 3:55 PM
>> To: java-user@lucene.apache.org
>> Subject: Partial word match using n-grams
>>
>> One of our main use-cases for search is to find objects based on 
>> partial name matches.  I've implemented this using n-grams and it 
>> works pretty well.  However we're currently using trigrams and that 
>> causes an interesting problem when searching for things like "abc ab"
>> since we first split on whitespace and then construct PhraseQuerys 
>> containing each trigram yielded by the "word".  Obviously we cannot 
>> get a trigram out of "ab".  So our choices would seem to be either 
>> discard this part of the search term which seems unwise, or to reduce 
>> the minimum n-gram size.  But I'm slightly concerned about the 
>> resulting bloat in both the of number of Terms stored in the index as 
>> well as contained in queries.  Is this something I should be concerned about?  It
just "feels" like a query for the word "abcdef"
>> shouldn't require a PhraseQuery of 15 terms (assuming n-grams 1,3).
>> Is this the best way to do partial word matches?  Thanks in advance.
>>
>> -Tommy
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message