lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Language Detection for Analysis?
Date Thu, 06 Aug 2009 20:16:05 GMT
Thanks Robert for the explanation. I thought that you meant something
different, like doing stemming in some sophisticated manner by somehow
detecting the language. Doing these normalizations makes sense of course,
especially if the letters look similar.

Thanks again,

Shai

On Thu, Aug 6, 2009 at 11:10 PM, Robert Muir <rcmuir@gmail.com> wrote:

> Shai, I mean doing language-agnostic things that apply to all of these
> since they are based on the same writing system, like normalizing all
> yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form,
> removing harakat, the kinds of things in ArabicNormalizationFilter and
> PersianNormalizationFilter.
>
> A parallel to this is doing "lowercase" to english, french, dutch,
> etc. Its a good idea.
>
> at least in the arabic case you can see here the precision/recall
> tradeoffs of doing just normalization as I mentioned versus stemming :
> http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf<http://www.mtholyoke.edu/%7Elballest/Pubs/arab_stem05.pdf>
> the benefit you see from stemming would assume you could language
> detect 100% accurately, since applying arabic stemming as is will be
> terrible on average for persian
>
> so I would definitely start with ArabicTokenizer +
> ArabicNormalizationFilter + PersianNormalizationFilter.
>
> i think you could also adjust the source code, for example I would
> probably very light stemming at least keeping leading و prefix for all
> these languages at least.
> selectively applying some of the persian "stopwords" such as ها plural
> would probably be ok across all of these as well.
>
> so I really have to wonder if the more complex approach at the end of
> the day would give you better results on average than doing
> normalization and maybe very light stemming/stopwords...
>
>
> Hope this helps,
> Robert
>
> On Thu, Aug 6, 2009 at 4:05 PM, Shai Erera<serera@gmail.com> wrote:
> > Robert - can you elaborate on what you mean by "just treat it at the
> script
> > level"?
> >
> > On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> >> Bradford, there is an arabic analyzer in trunk. for farsi there is
> >> currently a patch available:
> >> http://issues.apache.org/jira/browse/LUCENE-1628
> >>
> >> one option is not to detect languages at all.
> >> it could be hard for short queries due to the languages you mentioned
> >> borrowing from each other.
> >> but you do not want to apply things like stemming to the wrong language.
> >>
> >> instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
> >> PersianNormalizationFilter and just treat it at the script level.
> >>
> >> On Thu, Aug 6, 2009 at 3:46 PM, Bradford
> >> Stephens<bradfordstephens@gmail.com> wrote:
> >> > Hey there,
> >> >
> >> > We're trying to add foreign language support into our new search
> >> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with
> >> > standard analyzers). But our data source doesn't tell us which
> >> > languages we're actually collecting -- we just get blocks of text. Has
> >> > anyone here worked on language detection so we can figure out what
> >> > analyzers to use? Are there commercial solutions?
> >> >
> >> > Much appreciated!
> >> >
> >> > --
> >> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> >> > Media, and Computer Science
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message