From lucene-dev-return-736-qmlist-jakarta-archive-lucene-dev=jakarta.apache.org@jakarta.apache.org Wed Feb 13 17:08:46 2002 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 18356 invoked from network); 13 Feb 2002 17:08:46 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 13 Feb 2002 17:08:46 -0000 Received: (qmail 16782 invoked by uid 97); 13 Feb 2002 17:08:48 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 16740 invoked by uid 97); 13 Feb 2002 17:08:47 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 16725 invoked from network); 13 Feb 2002 17:08:46 -0000 Message-ID: <01f401c1b4b1$1c3d8750$520010ac@muscade> From: "Julien Nioche" To: "Lucene Developers List" References: Subject: Re : How does Lucene handle phrases containing words that are not indexed? Date: Wed, 13 Feb 2002 18:08:49 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N By the way, I was wondering if there is any Analyzer that uses the following constructor public Token(String text, int start, int end, String typ) ? Maybe it could be interesting to build an analyzer that recognizes punctuation marks and keeps it in the index as Tokens with a given Type (say for example "punctuation") ? The advantage is that information could be used by a SloppyPhraseScorer.phraseFreq() method to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are used for compound words (e.g. "personal computer") with a given slop value (say 3), it could be great not to match things such as "It is not personal. My computer hates me..." . A solution could be to set a slop value of zero, but it is not possible in my case (I use a module that generates compound terms with slop values, in order to handle morphologic variations - eg in French "gestion de la casse" and "gestion des casses" which are represented by "gestion casse"^3 and "gestion casses"^3). This involves creating a subclasse of PhraseQuery or modifing it by adding a boolean to it and modifying the phraseFreq() method so that it checks that there is no Token with a punctuation Type in the scope of the slop. What do you think about it? Has anyone already tried in that direction? Does it implies heavy changes? Hugo : maybe you could store your stopwords as tokens with a different type? ----- Original Message ----- From: "hugo burm" To: Sent: Wednesday, February 13, 2002 5:32 PM Subject: How does Lucene handle phrases containing words that are not indexed? > > How does Lucene handle phrases (literals) containing words that are not > indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests > (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases > it looks like that when you are looking for the phrase "a specification" it > also finds documents which contain "the specification". (or: "D. Washington" > instead of "G. Washington"). > > Of course you can change the index behaviour and make sure there are no > stopwords, and all one-letter words and numbers are indexed. But that seems > a bad approach. A better approach: 1) find all indexed words in the phrase > and from these words find all documents containing these words. 2) check the > occurence of the phrase by opening the original document. I am wondering: > does Lucene performs step 2)? Off course this step burns some cpu cycles. > > Hugo > > hugob@xs4all.nl > > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > > -- To unsubscribe, e-mail: For additional commands, e-mail: