Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 83121 invoked from network); 29 Dec 2001 14:52:19 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 29 Dec 2001 14:52:19 -0000 Received: (qmail 29080 invoked by uid 97); 29 Dec 2001 14:52:21 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 29032 invoked by uid 97); 29 Dec 2001 14:52:18 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 29021 invoked from network); 29 Dec 2001 14:52:18 -0000 Message-ID: <001501c19078$e10745b0$024a1390@trollw2kserver> Reply-To: "Brian Brown" From: "Brian Brown" To: "Lucene Developers List" References: <4BC270C6AB8AD411AD0B00B0D0493DF0EE7D74@mail.grandcentral.com> <20011211034401.A14573@lx.quiotix.com> Subject: Re: searching words starting with accent characters using UTF-8 Date: Sat, 29 Dec 2001 15:55:38 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I am developing a French search engine using Lucene. To this end I used Gerhard Schwarz's German analyser as a starting point. This seems to work ok, the main difference that I am using a lookup table approach rather than stemming by calculation. I find it is also necessary to adapt the QueryParser for accented characters. My approach is simply to add �,�,�,�,... into the the #_TERM_CHAR and #_TERM_START_CHAR character sets. My question is: what is the purpose of adding in all the characters: "\u0080"-"\uFFFE" which I find in the current source? Brian Brown ----- Original Message ----- From: "Brian Goetz" To: "Lucene Developers List" Sent: Tuesday, December 11, 2001 12:44 PM Subject: Re: searching words starting with accent characters using UTF-8 > > Thanks! That would be great! > > Be careful what you ask for, I foobared it up the last time... :) > > > Yes, this is a lot of features, and a lot of syntax. The query parser is > > already complicated. Perhaps we should instead write a number of example > > query parsers that do different things, and encourage folks to write their > > own, with these as models. Unfortunately, I'm not sure many folks would do > > that: instead they would ask why one parser doesn't have a feature that > > another does. So I'm having a hard time seeing a non-kitchen-sink > > alternative. Do you? > > I don't really object to a kitchen sink approach, but I prefer to have > it done all at once rather than added incrementally. > > So far we have: > - Prefix (currently *) > - Fuzzy ( currently ~) > - Boost (currently ^nn) > - AND, OR, NOT, &&, ||, ! > - Phrases ("foo bar") > > We want to add: > - NEAR/phrase-with-slop > > > > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > -- To unsubscribe, e-mail: For additional commands, e-mail: