Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 11503 invoked from network); 19 Nov 2002 16:47:26 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 19 Nov 2002 16:47:26 -0000 Received: (qmail 6880 invoked by uid 97); 19 Nov 2002 16:48:24 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 6844 invoked by uid 97); 19 Nov 2002 16:48:23 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 6832 invoked by uid 98); 19 Nov 2002 16:48:23 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Reply-To: From: "Alex Murzaku" To: "'Lucene Developers List'" Subject: RE: language identifier, stemmers and analyzers Date: Tue, 19 Nov 2002 11:47:26 -0500 Organization: LISSUS llc Message-ID: <000001c28feb$5817da80$5ae3fea9@Lissus> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.4024 Importance: Normal In-Reply-To: <0a8401c28d73$212fc750$4600a8c0@whale> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Putting everything in one index should be fine as long as you know the analyzer that created each term. This means that you need to store the language ID of each document indexed, which would mean building virtual separate indices for each language. Once you index using different analyzer/stemmers, you also need to establish your search strategy. The same analyzers should be applied to the search process as well. The problem with the automatic analyzer selection is that queries are usually short and the language guesser will not be as effective with it. You might use the language field and manual language selection for this. -- Alex Murzaku ___________________________________________ alex(at)lissus.com http://www.lissus.com -----Original Message----- From: maurits van wijland [mailto:m.vanwijland@quicknet.nl] Sent: Saturday, November 16, 2002 8:22 AM To: Lucene Developers List Cc: Brad Wellington Subject: Re: language identifier, stemmers and analyzers Otis, Thanks for the reply. > > 1. Ideally, yes, if you ask me. You get email in at least 2 languages > - wouldn't it make sense to have it all indexed in a single email > index? > > 2. I think it would be nice to have an Analyzer that can pick the > correct Analyzer based on the language, but since language identifier > can also be retrieved from Brad's code directly, one will always be > able to opt for using custom logic in their application instead of > using your language-aware Analyzer. So my opinion is that a > specialized Analyzer that can pick the right Analyzer implementation > based on the language of the input would be good, as it does not > prevent developers from using Brad's code directly. That makes sense. I first thought that the analyzer would be a problem, because the queryparser should use the same analyzer! But I guess that this special analyzer would initiate a language specific analyzer to stem the words accordingly. And yes, Brad's code can be used directly. Ofcourse. Brad has made a terrific language identifier that is suitable for more uses other than Lucene's. And it works like a charm and works with international character standards. I will put together a package with an analyzer, a language model (will include the language source files so anybody can rebuild the model). Give me a couple of days, because I am currently swammped with work, but will soon post the result to the list. > > Is this something that can be included in Lucene core/sandbox? > This is for the code/sandbox yes. regards, Maurits. > Otis > > > --- maurits van wijland wrote: > > Dear all, > > > > Brad Wellington has created a language identifier which can be used > > in combination with > > the snowball stemmers donated to Lucene by Alex Murzaku. I have > > currently > > build a solid language model for use with the language identifier for > > the > > languages: Danish, Dutch, English, Finnish, French, German, Italian, > > Norwegian, Portuguese, Spanish and Swedisch. > > > > The language identifier is based on a Naive Bayes classifier. Now, > > this is all nice, but I have some integration questions, and I hope > > you can help > > out. > > > > Basically, the process of indexing is: > > Create an analyzer > > Open a IndexWriter > > Pass it the analyzer > > Proces a document > > Add document to Index > > Optimize writer > > Close writer > > > > Now, the language identifier can help automatically identify what > > langauge a document is written in. Based on the suggestion of the > > identifier, an apropriate analyzer can be selected. > > > > This is al great, but... > > > > 1. Do we index all the terms from various documents in various > > languages into 1 index? > > 2. Do I build a specialised Analyzer that selects the stemmer based > > on the > > Language Identifier or leave that up to the custom indexing > > application? > > > > Your thoughts please... > > > > regards, > > > > Maurits > > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Web Hosting - Let the expert host your site > http://webhosting.yahoo.com > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > -- To unsubscribe, e-mail: For additional commands, e-mail: -- To unsubscribe, e-mail: For additional commands, e-mail: