Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Reply-To: <lists@lissus.com>
From: "Alex Murzaku" <lists@lissus.com>
To: "'Lucene Developers List'" <lucene-dev@jakarta.apache.org>
Subject: RE: language identifier, stemmers and analyzers
Date: Tue, 19 Nov 2002 11:47:26 -0500
Organization: LISSUS llc
Message-ID: <000001c28feb$5817da80$5ae3fea9@Lissus>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
Importance: Normal
In-Reply-To: <0a8401c28d73$212fc750$4600a8c0@whale>

Putting everything in one index should be fine as long as you know the
analyzer that created each term. This means that you need to store the
language ID of each document indexed, which would mean building virtual
separate indices for each language.

Once you index using different analyzer/stemmers, you also need to
establish your search strategy. The same analyzers should be applied to
the search process as well. The problem with the automatic analyzer
selection is that queries are usually short and the language guesser
will not be as effective with it. You might use the language field and
manual language selection for this.

-- 
Alex Murzaku
___________________________________________
 alex(at)lissus.com  http://www.lissus.com            

-----Original Message-----
From: maurits van wijland [mailto:m.vanwijland@quicknet.nl] 
Sent: Saturday, November 16, 2002 8:22 AM
To: Lucene Developers List
Cc: Brad Wellington
Subject: Re: language identifier, stemmers and analyzers


Otis,

Thanks for the reply.

>
> 1. Ideally, yes, if you ask me.  You get email in at least 2 languages
> - wouldn't it make sense to have it all indexed in a single email 
> index?

>
> 2. I think it would be nice to have an Analyzer that can pick the 
> correct Analyzer based on the language, but since language identifier 
> can also be retrieved from Brad's code directly, one will always be 
> able to opt for using custom logic in their application instead of 
> using your language-aware Analyzer. So my opinion is that a 
> specialized Analyzer that can pick the right Analyzer implementation 
> based on the language of the input would be good, as it does not 
> prevent developers from using Brad's code directly.
That makes sense. I first thought that the analyzer would be a problem,
because the queryparser should use the same analyzer! But I guess that
this special analyzer would initiate a language specific analyzer to
stem the words accordingly.

And yes, Brad's code can be used directly. Ofcourse. Brad has made a
terrific language identifier that is suitable for more uses other than
Lucene's. And it works like a charm and works with international
character standards.

I will put together a package with an analyzer, a language model (will
include the language source files so anybody can rebuild the model).
Give me a couple of days, because I am currently swammped with work, but
will soon post the result to the list.

>
> Is this something that can be included in Lucene core/sandbox?
>
This is for the code/sandbox yes.

regards,

Maurits.

> Otis
>
>
> --- maurits van wijland <m.vanwijland@quicknet.nl> wrote:
> > Dear all,
> >
> > Brad Wellington has created a language identifier which can be used 
> > in combination with
> > the snowball stemmers donated to Lucene by Alex Murzaku. I have
> > currently
> > build a solid language model for use with the language identifier
for
> > the
> > languages: Danish, Dutch, English, Finnish, French, German, Italian,
> > Norwegian, Portuguese, Spanish and Swedisch.
> >
> > The language identifier is based on a Naive Bayes classifier. Now, 
> > this is all nice, but I have some integration questions, and I hope 
> > you can help
> > out.
> >
> > Basically, the process of indexing is:
> > Create an analyzer
> > Open a IndexWriter
> > Pass it the analyzer
> > Proces a document
> > Add document to Index
> > Optimize writer
> > Close writer
> >
> > Now, the language identifier can help automatically identify what 
> > langauge a document is written in. Based on the suggestion of the 
> > identifier, an apropriate analyzer can be selected.
> >
> > This is al great, but...
> >
> > 1. Do we index all the terms from various documents in various 
> > languages into 1 index?
> > 2. Do I build a specialised Analyzer that selects the stemmer based
> > on the
> > Language Identifier or leave that up to the custom indexing
> > application?
> >
> > Your thoughts please...
> >
> > regards,
> >
> > Maurits
>
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site 
> http://webhosting.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>