lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche" <>
Subject Re : How does Lucene handle phrases containing words that are not indexed?
Date Wed, 13 Feb 2002 17:08:49 GMT
By the way, I was wondering if there is any Analyzer that uses the following
  public Token(String text, int start, int end, String typ) ?

Maybe it could be interesting to build an analyzer that recognizes
punctuation marks and
keeps it in the index as Tokens with a given Type (say for example
"punctuation") ?

The advantage is that information could be used by a
SloppyPhraseScorer.phraseFreq() method
to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are
used for compound words
(e.g. "personal computer") with a given slop value (say 3), it could be
great not to match things such as "It is not personal. My computer hates
me..." .

A solution could be to set a slop value of zero, but it is not possible in
my case (I use a  module that generates compound terms with slop values, in
order to handle morphologic variations - eg in French "gestion de la casse"
and "gestion des casses" which are represented by "gestion casse"^3 and
"gestion casses"^3).

This involves creating a subclasse of PhraseQuery or modifing it by adding a
boolean to it and  modifying the phraseFreq() method so that it checks that
there is no Token with a punctuation Type in the scope of the slop.

What do you think about it? Has anyone already tried in that direction? Does
it implies heavy changes?

Hugo : maybe you could store your stopwords as tokens with a different type?

----- Original Message -----
From: "hugo burm" <>
To: <>
Sent: Wednesday, February 13, 2002 5:32 PM
Subject: How does Lucene handle phrases containing words that are not

> How does Lucene handle phrases (literals) containing words that are not
> indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
> (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
> it looks like that when you are looking for the phrase "a specification"
> also finds documents which contain "the specification". (or: "D.
> instead of "G. Washington").
> Of course you can change the index behaviour and make sure there are no
> stopwords, and all one-letter words and numbers are indexed. But that
> a bad approach. A better approach: 1) find all indexed words in the phrase
> and from these words find all documents containing these words. 2) check
> occurence of the phrase by opening the original document.  I am wondering:
> does Lucene performs step 2)? Off course this step burns some cpu cycles.
> Hugo
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message