lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Tokenization and Fuzziness: How to Allow Multiple Strategies?
Date Mon, 07 Feb 2011 18:54:47 GMT
Hi Tavi,

solr-dev@lucene.apache.org has been deprecated since the Lucene and Solr source trees merged
last year.  Please use dev@lucene.apache.org instead.

However, your question is about *usage* of Lucene/Solr, rather than *development*, so you
should be using solr-user@lucene.apache.org or lucene-user@lucene.apache.org.  Please repost
your question to one of these lists.

Steve

> -----Original Message-----
> From: Tavi Nathanson [mailto:tavi.nathanson@gmail.com]
> Sent: Monday, February 07, 2011 12:12 PM
> To: solr-dev@lucene.apache.org
> Subject: Tokenization and Fuzziness: How to Allow Multiple Strategies?
> 
> 
> Hey everyone,
> 
> Tokenization seems inherently fuzzy and imprecise, yet Lucene does not
> appear to provide an easy mechanism to account for this fuzziness.
> 
> Let's take an example, where the document I'm indexing is "v1.1.0 mr.
> jones
> david@gmail.com"
> 
> I may want to tokenize this as follows: ["v1.1.0", "mr", "jones",
> "david@gmail.com"]
> ...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones",
> "david", "gmail.com"]
> ...or I may want to tokenize it another way.
> 
> I would think that the best approach would be indexing using multiple
> strategies, such as:
> 
> ["v1.1.0", "v1", "1.0", "mr", "jones", "david@gmail.com", "david",
> "gmail.com"]
> 
> However, this would destroy phrase queries. And while Lucene lets you
> index
> multiple tokens at the same position, I haven't found a way to deal with
> cases where you want to index a set of tokens at one position: nor does
> that
> even make sense. For instance, I can't index ["david", "gmail.com"] in the
> same position as "david@gmail.com".
> 
> So:
> 
> - Any thoughts, in general, about how you all approach this fuzziness? Do
> you just choose one tokenization strategy and hope for the best?
> - Might there be a way to use multiple strategies and *not* break phrase
> queries that I'm overlooking?
> 
> Thanks!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-
> Allow-Multiple-Strategies-tp2444956p2444956.html
> Sent from the Solr - Dev mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org

Mime
View raw message