lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Partial word match using n-grams
Date Fri, 19 Jul 2013 04:59:11 GMT
There are several options:

As Allison suggested, pad your words with ##, so that "quota tom" becomes
"##quota## ##tom##" at indexing time, and the query "quota to" becomes
either "##quota ##to", or if you want to optimize, only pad query terms < 3
characters, so it becomes "quota ##to". That should guarantee you will find
matches even if the user enters one character. Note that it will add more
terms to the index, but I suspect not much. E.g. for English, assuming all
words begin w/ letters you will add 26 ##[a-z] and 676 #[a-z][a-z] terms,
which isn't much. Overall, even for numbers and other languages, I don't
think it will bloat your index and this technique should have good
performance.

You can optimize that further depending whether you need to math "ta" with
"quota". If not, you don't need to pad with ## in the end of words, only
the beginning.

If you're worried about index bloat, you can convert queries like "quota
to" to the query "quota to*", i.e. MultiPhraseQuery. You do that only when
words are less than 3 characters. But I think the padding is the better
solution both from coding and performance perspectives.

Shai

On Fri, Jul 19, 2013 at 3:59 AM, Becker, Thomas <Thomas.Becker@netapp.com>wrote:

> Thanks for the reply Tim.  I really should have been clearer.  Let's say I
> have an object named "quota_tommy_1234".  I'd like to match that object
> with any 3 character (or more) substring of that name.  So for example:
>
> quo
> tom
> 234
> quota
> etc.
>
> Further, at search time I'm splitting input on whitespace before
> tokenizing into PhraseQueries and then ANDing them together.  So using the
> example above I also want the following queries to match:
>
> quo tom
> quo 234
> quota to <- this is the problem because there are no trigrams of "to"
>
> That said, in response to your points:
>
> 1)  Not sure FuzzyQuery is what I need; I'm not trying to match via
> misspellings, which is the main function of FuzzyQuery is it not?
>
> 2) The original names are all going to be > 3 characters, so there are no
> 1 or 2 letter terms at indexing time.  So generating the bigram "to" at
> search time will never match anything, unless I switch to bigrams at
> indexing time also, which is what I'm asking about.
>
> 3)  Again the names are all > 3 characters so I don't need to pad at
> indexing time.
>
> 4) Hopefully my explanation above clarifies.
>
> I should point out that I'm a Lucene novice and am not at all sure that
> what I'm doing is optimal.  But I have been impressed with how easy it is
> to get something working very quickly!
>
> ________________________________________
> From: Allison, Timothy B. [tallison@mitre.org]
> Sent: Thursday, July 18, 2013 7:49 PM
> To: java-user@lucene.apache.org
> Subject: RE: Partial word match using n-grams
>
> Tommy,
>   I'm sure that I don't fully understand your use case and your data.
>  Some thoughts:
>
> 1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your
> needs or else you wouldn't have gone the ngram route.  If fuzzy term search
> + phrase/proximity search would meet your needs, see if
> ComplexPhraseQueryParser would work (although it looks like you're already
> building your own queries).
>
> 2) Would it make sense to modify NGramFilter so that it outputs a bigram
> for a two letter term and a unigram for a one letter term?  Might be
> messy...and "ab" in this scenario would never match "abc"
>
> 3) Would it make sense to pad your terms behind the scenes with
> "##"...this would add bloat, but not nearly as much as variable gram sizes
> with 1<= n <=3
>
> ab -> ##ab## yields trigrams ##a, #ab, ab#, b##
>
> 4) How partial and what types of partial do you need?  This is related to
> 1).  If minimum edit distance is sufficient; use it, especially with the
> blazing fast automaton (thank you, Robert Muir). If you have a smallish
> dataset you might consider allowing leading wildcards so that you could
> easily find all words, for example, containing abc with *abc*.  If your
> dataset is larger, you might consider something like
> ReversedWildcardFilterFactory (Solr) to speed this type of matching.
>
> I look forward to other opinions from the list.
>
> -----Original Message-----
> From: Becker, Thomas [mailto:Thomas.Becker@netapp.com]
> Sent: Thursday, July 18, 2013 3:55 PM
> To: java-user@lucene.apache.org
> Subject: Partial word match using n-grams
>
> One of our main use-cases for search is to find objects based on partial
> name matches.  I've implemented this using n-grams and it works pretty
> well.  However we're currently using trigrams and that causes an
> interesting problem when searching for things like "abc ab" since we first
> split on whitespace and then construct PhraseQuerys containing each trigram
> yielded by the "word".  Obviously we cannot get a trigram out of "ab".  So
> our choices would seem to be either discard this part of the search term
> which seems unwise, or to reduce the minimum n-gram size.  But I'm slightly
> concerned about the resulting bloat in both the of number of Terms stored
> in the index as well as contained in queries.  Is this something I should
> be concerned about?  It just "feels" like a query for the word "abcdef"
> shouldn't require a PhraseQuery of 15 terms (assuming n-grams 1,3).  Is
> this the best way to do partial word matches?  Thanks in advance.
>
> -Tommy
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message