lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maurits van wijland <m.vanwijl...@quicknet.nl>
Subject Re: language identifier, stemmers and analyzers
Date Sat, 16 Nov 2002 13:21:49 GMT
Otis,

Thanks for the reply.

>
> 1. Ideally, yes, if you ask me.  You get email in at least 2 languages
> - wouldn't it make sense to have it all indexed in a single email
> index?

>
> 2. I think it would be nice to have an Analyzer that can pick the
> correct Analyzer based on the language, but since language identifier
> can also be retrieved from Brad's code directly, one will always be
> able to opt for using custom logic in their application instead of
> using your language-aware Analyzer.
> So my opinion is that a specialized Analyzer that can pick the right
> Analyzer implementation based on the language of the input would be
> good, as it does not prevent developers from using Brad's code
> directly.
That makes sense. I first thought that the analyzer would be a problem,
because the queryparser should use the same analyzer! But I guess that
this special analyzer would initiate a language specific analyzer to stem
the words accordingly.

And yes, Brad's code can be used directly. Ofcourse. Brad has made a
terrific language identifier that is suitable for more uses other than
Lucene's.
And it works like a charm and works with international character standards.

I will put together a package with an analyzer, a language model (will
include
the language source files so anybody can rebuild the model).
Give me a couple of days, because I am currently swammped with
work, but will soon post the result to the list.

>
> Is this something that can be included in Lucene core/sandbox?
>
This is for the code/sandbox yes.

regards,

Maurits.

> Otis
>
>
> --- maurits van wijland <m.vanwijland@quicknet.nl> wrote:
> > Dear all,
> >
> > Brad Wellington has created a language identifier which can be used
> > in
> > combination with
> > the snowball stemmers donated to Lucene by Alex Murzaku. I have
> > currently
> > build a solid language model for use with the language identifier for
> > the
> > languages: Danish, Dutch, English, Finnish, French, German, Italian,
> > Norwegian, Portuguese, Spanish and Swedisch.
> >
> > The language identifier is based on a Naive Bayes classifier. Now,
> > this is
> > all nice, but I have some integration questions, and I hope you can
> > help
> > out.
> >
> > Basically, the process of indexing is:
> > Create an analyzer
> > Open a IndexWriter
> > Pass it the analyzer
> > Proces a document
> > Add document to Index
> > Optimize writer
> > Close writer
> >
> > Now, the language identifier can help automatically identify what
> > langauge a
> > document is written in. Based on the suggestion of the identifier, an
> > apropriate analyzer can be selected.
> >
> > This is al great, but...
> >
> > 1. Do we index all the terms from various documents in various
> > languages
> > into 1 index?
> > 2. Do I build a specialised Analyzer that selects the stemmer based
> > on the
> > Language Identifier or leave that up to the custom indexing
> > application?
> >
> > Your thoughts please...
> >
> > regards,
> >
> > Maurits
>
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site
> http://webhosting.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message