lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Language Detection for Analysis?
Date Thu, 06 Aug 2009 20:10:33 GMT
Shai, I mean doing language-agnostic things that apply to all of these
since they are based on the same writing system, like normalizing all
yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form,
removing harakat, the kinds of things in ArabicNormalizationFilter and
PersianNormalizationFilter.

A parallel to this is doing "lowercase" to english, french, dutch,
etc. Its a good idea.

at least in the arabic case you can see here the precision/recall
tradeoffs of doing just normalization as I mentioned versus stemming :
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
the benefit you see from stemming would assume you could language
detect 100% accurately, since applying arabic stemming as is will be
terrible on average for persian

so I would definitely start with ArabicTokenizer +
ArabicNormalizationFilter + PersianNormalizationFilter.

i think you could also adjust the source code, for example I would
probably very light stemming at least keeping leading و prefix for all
these languages at least.
selectively applying some of the persian "stopwords" such as ها plural
would probably be ok across all of these as well.

so I really have to wonder if the more complex approach at the end of
the day would give you better results on average than doing
normalization and maybe very light stemming/stopwords...


Hope this helps,
Robert

On Thu, Aug 6, 2009 at 4:05 PM, Shai Erera<serera@gmail.com> wrote:
> Robert - can you elaborate on what you mean by "just treat it at the script
> level"?
>
> On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Bradford, there is an arabic analyzer in trunk. for farsi there is
>> currently a patch available:
>> http://issues.apache.org/jira/browse/LUCENE-1628
>>
>> one option is not to detect languages at all.
>> it could be hard for short queries due to the languages you mentioned
>> borrowing from each other.
>> but you do not want to apply things like stemming to the wrong language.
>>
>> instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
>> PersianNormalizationFilter and just treat it at the script level.
>>
>> On Thu, Aug 6, 2009 at 3:46 PM, Bradford
>> Stephens<bradfordstephens@gmail.com> wrote:
>> > Hey there,
>> >
>> > We're trying to add foreign language support into our new search
>> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with
>> > standard analyzers). But our data source doesn't tell us which
>> > languages we're actually collecting -- we just get blocks of text. Has
>> > anyone here worked on language detection so we can figure out what
>> > analyzers to use? Are there commercial solutions?
>> >
>> > Much appreciated!
>> >
>> > --
>> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> > Media, and Computer Science
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message