lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Partial word match using n-grams
Date Fri, 19 Jul 2013 12:52:41 GMT
Wait, I didn't mean to pad the entire string. If the string is broken on _
already, then NGramFilter already receives the individual terms and you can
put a Filter in front that will pass through a padded token?

Shai


On Fri, Jul 19, 2013 at 3:45 PM, Becker, Thomas <Thomas.Becker@netapp.com>wrote:

> In general the data for this field is that simple, but additional
> characters are allowed beyond [a-z_].  Do I need to tokenize on whitespace?
>  I really don't know.  Essentially, the question is whether we expect
> "quota tom" to match quota_tom or not.  I spoke to some colleagues and they
> thought it should since both "quota" and "tom" are partial matches that
> would AND together.  Tokenizing the entire input whitespace and all
> precludes this match.  I'd appreciate some input from anyone on what the
> best user experience would be here; I'm trying to operate on principle of
> least surprise ;)
>
> With regard to the padding suggestion, I'm still not sure this will work.
>  Because again at indexing time there is typically no whitespace.  So
> padding "quota_tommy_1234" to "## quota_tommy_1234##" before trigramming is
> not going to produce a to#  token that I would need in order for "quota to"
> to match.
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:tallison@mitre.org]
> Sent: Friday, July 19, 2013 7:58 AM
> To: java-user@lucene.apache.org
> Subject: RE: Partial word match using n-grams
>
> Got it...almost.
>
> Y. You're right. FuzzyQuery is not at all what you want.
>
> Don't know if your data is actually as simple as this example.  Do you
> need to tokenize on whitespace?   Would it make sense to replace spaces in
> the query with underscores and then trigramify the whole query as if it
> were a single term?
>
> ________________________________________
> From: Becker, Thomas [Thomas.Becker@netapp.com]
> Sent: Thursday, July 18, 2013 8:59 PM
> To: java-user@lucene.apache.org
> Subject: RE: Partial word match using n-grams
>
> Thanks for the reply Tim.  I really should have been clearer.  Let's say I
> have an object named "quota_tommy_1234".  I'd like to match that object
> with any 3 character (or more) substring of that name.  So for example:
>
> quo
> tom
> 234
> quota
> etc.
>
> Further, at search time I'm splitting input on whitespace before
> tokenizing into PhraseQueries and then ANDing them together.  So using the
> example above I also want the following queries to match:
>
> quo tom
> quo 234
> quota to <- this is the problem because there are no trigrams of "to"
>
> That said, in response to your points:
>
> 1)  Not sure FuzzyQuery is what I need; I'm not trying to match via
> misspellings, which is the main function of FuzzyQuery is it not?
>
> 2) The original names are all going to be > 3 characters, so there are no
> 1 or 2 letter terms at indexing time.  So generating the bigram "to" at
> search time will never match anything, unless I switch to bigrams at
> indexing time also, which is what I'm asking about.
>
> 3)  Again the names are all > 3 characters so I don't need to pad at
> indexing time.
>
> 4) Hopefully my explanation above clarifies.
>
> I should point out that I'm a Lucene novice and am not at all sure that
> what I'm doing is optimal.  But I have been impressed with how easy it is
> to get something working very quickly!
>
> ________________________________________
> From: Allison, Timothy B. [tallison@mitre.org]
> Sent: Thursday, July 18, 2013 7:49 PM
> To: java-user@lucene.apache.org
> Subject: RE: Partial word match using n-grams
>
> Tommy,
>   I'm sure that I don't fully understand your use case and your data.
>  Some thoughts:
>
> 1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your
> needs or else you wouldn't have gone the ngram route.  If fuzzy term search
> + phrase/proximity search would meet your needs, see if
> ComplexPhraseQueryParser would work (although it looks like you're already
> building your own queries).
>
> 2) Would it make sense to modify NGramFilter so that it outputs a bigram
> for a two letter term and a unigram for a one letter term?  Might be
> messy...and "ab" in this scenario would never match "abc"
>
> 3) Would it make sense to pad your terms behind the scenes with
> "##"...this would add bloat, but not nearly as much as variable gram sizes
> with 1<= n <=3
>
> ab -> ##ab## yields trigrams ##a, #ab, ab#, b##
>
> 4) How partial and what types of partial do you need?  This is related to
> 1).  If minimum edit distance is sufficient; use it, especially with the
> blazing fast automaton (thank you, Robert Muir). If you have a smallish
> dataset you might consider allowing leading wildcards so that you could
> easily find all words, for example, containing abc with *abc*.  If your
> dataset is larger, you might consider something like
> ReversedWildcardFilterFactory (Solr) to speed this type of matching.
>
> I look forward to other opinions from the list.
>
> -----Original Message-----
> From: Becker, Thomas [mailto:Thomas.Becker@netapp.com]
> Sent: Thursday, July 18, 2013 3:55 PM
> To: java-user@lucene.apache.org
> Subject: Partial word match using n-grams
>
> One of our main use-cases for search is to find objects based on partial
> name matches.  I've implemented this using n-grams and it works pretty
> well.  However we're currently using trigrams and that causes an
> interesting problem when searching for things like "abc ab" since we first
> split on whitespace and then construct PhraseQuerys containing each trigram
> yielded by the "word".  Obviously we cannot get a trigram out of "ab".  So
> our choices would seem to be either discard this part of the search term
> which seems unwise, or to reduce the minimum n-gram size.  But I'm slightly
> concerned about the resulting bloat in both the of number of Terms stored
> in the index as well as contained in queries.  Is this something I should
> be concerned about?  It just "feels" like a query for the word "abcdef"
> shouldn't require a PhraseQuery of 15 terms (assuming n-grams 1,3).  Is
> this the best way to do partial word matches?  Thanks in advance.
>
> -Tommy
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message