lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tavi Nathanson <tavi.nathan...@gmail.com>
Subject Tokenization and Fuzziness: How to Allow Multiple Strategies?
Date Mon, 07 Feb 2011 17:12:11 GMT

Hey everyone,

Tokenization seems inherently fuzzy and imprecise, yet Lucene does not
appear to provide an easy mechanism to account for this fuzziness.

Let's take an example, where the document I'm indexing is "v1.1.0 mr. jones
david@gmail.com"

I may want to tokenize this as follows: ["v1.1.0", "mr", "jones",
"david@gmail.com"]
...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones",
"david", "gmail.com"]
...or I may want to tokenize it another way.

I would think that the best approach would be indexing using multiple
strategies, such as:

["v1.1.0", "v1", "1.0", "mr", "jones", "david@gmail.com", "david",
"gmail.com"]

However, this would destroy phrase queries. And while Lucene lets you index
multiple tokens at the same position, I haven't found a way to deal with
cases where you want to index a set of tokens at one position: nor does that
even make sense. For instance, I can't index ["david", "gmail.com"] in the
same position as "david@gmail.com".

So:

- Any thoughts, in general, about how you all approach this fuzziness? Do
you just choose one tokenization strategy and hope for the best?
- Might there be a way to use multiple strategies and *not* break phrase
queries that I'm overlooking?

Thanks!
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-Allow-Multiple-Strategies-tp2444956p2444956.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message