lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Re : How does Lucene handle phrases containing words that are not indexed?
Date Thu, 14 Feb 2002 17:42:43 GMT
> From: Julien Nioche [mailto:julien.nioche@lingway.com]
> 
> By the way, I was wondering if there is any Analyzer that 
> uses the following
> constructor
>   public Token(String text, int start, int end, String typ) ?

StandardTokenizer uses Token's type field to communicate with
StandardFilter, which does some post-processing.

> Maybe it could be interesting to build an analyzer that recognizes
> punctuation marks and
> keeps it in the index as Tokens with a given Type (say for example
> "punctuation") ?

Unfortunately token type is not stored in the index.  Adding it could have a
big impact on index size and search performance.

> The advantage is that information could be used by a
> SloppyPhraseScorer.phraseFreq() method
> to avoid PhraseQuery containing a punctuation mark. Since 
> PhraseQueries are
> used for compound words
> (e.g. "personal computer") with a given slop value (say 3), 
> it could be
> great not to match things such as "It is not personal. My 
> computer hates
> me..." .

On the other hand, you'd miss things like, "He needs a new computer.
Personal computing has advanced since 1970."

Still, constraining matches to be within a sentence can be useful, but
Lucene does not currently support it, and I don't see an easy way to add it.

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message