Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates
 209.85.221.203 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=WYmjM5pHXavvegz3q3rbuR38NELFDnw91A8wNQAmrQWbReHW8IRLTRfvredwKGot6e
         p9JeikWPJzoUNcA/BblOEiC57/xYHm5vwdxOiVH7xsYgA/JeA0JfjGzroTyf8E6HBvSE
         SZDTbHZ2h/YbkRPjj/HL+1Uv5mbwljC6VbKPM=
MIME-Version: 1.0
In-Reply-To: <786fde50908061305m64e4ffa9kfcdcc170b7229194@mail.gmail.com>
References: <860544ed0908061246h49485a65se5b1acc5719343e9@mail.gmail.com>
	 <8f0ad1f30908061255m70212637s1d788dd49f20a0aa@mail.gmail.com>
	 <786fde50908061305m64e4ffa9kfcdcc170b7229194@mail.gmail.com>
Date: Thu, 6 Aug 2009 16:10:33 -0400
Message-ID: <8f0ad1f30908061310j298fa42ai8f8b773a78be37c3@mail.gmail.com>
Subject: Re: Language Detection for Analysis?
From: Robert Muir <rcmuir@gmail.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Shai, I mean doing language-agnostic things that apply to all of these
since they are based on the same writing system, like normalizing all
yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form,
removing harakat, the kinds of things in ArabicNormalizationFilter and
PersianNormalizationFilter.

A parallel to this is doing "lowercase" to english, french, dutch,
etc. Its a good idea.

at least in the arabic case you can see here the precision/recall
tradeoffs of doing just normalization as I mentioned versus stemming :
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
the benefit you see from stemming would assume you could language
detect 100% accurately, since applying arabic stemming as is will be
terrible on average for persian

so I would definitely start with ArabicTokenizer +
ArabicNormalizationFilter + PersianNormalizationFilter.

i think you could also adjust the source code, for example I would
probably very light stemming at least keeping leading =D9=88 prefix for all
these languages at least.
selectively applying some of the persian "stopwords" such as =D9=87=D8=A7 p=
lural
would probably be ok across all of these as well.

so I really have to wonder if the more complex approach at the end of
the day would give you better results on average than doing
normalization and maybe very light stemming/stopwords...


Hope this helps,
Robert

On Thu, Aug 6, 2009 at 4:05 PM, Shai Erera<serera@gmail.com> wrote:
> Robert - can you elaborate on what you mean by "just treat it at the scri=
pt
> level"?
>
> On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Bradford, there is an arabic analyzer in trunk. for farsi there is
>> currently a patch available:
>> http://issues.apache.org/jira/browse/LUCENE-1628
>>
>> one option is not to detect languages at all.
>> it could be hard for short queries due to the languages you mentioned
>> borrowing from each other.
>> but you do not want to apply things like stemming to the wrong language.
>>
>> instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
>> PersianNormalizationFilter and just treat it at the script level.
>>
>> On Thu, Aug 6, 2009 at 3:46 PM, Bradford
>> Stephens<bradfordstephens@gmail.com> wrote:
>> > Hey there,
>> >
>> > We're trying to add foreign language support into our new search
>> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with
>> > standard analyzers). But our data source doesn't tell us which
>> > languages we're actually collecting -- we just get blocks of text. Has
>> > anyone here worked on language detection so we can figure out what
>> > analyzers to use? Are there commercial solutions?
>> >
>> > Much appreciated!
>> >
>> > --
>> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> > Media, and Computer Science
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


--=20
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org