lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nader S. Henein" <...@bayt.net>
Subject RE: Internationalization - Arabic Language Support
Date Sat, 29 Jun 2002 08:29:45 GMT
I'm indexing arabic in my index and to make it searchable I had to switch
character sets
(not fun) the problem lies in the week standards surrounding Arabic
Character sets
between ISO 8895-6 , win-1256 and UTF-8 you can have three different
representations of the
same exact thing UTF-8 store arabic in numeric form ( the code that
represent each letter)
the lucene analyzer isn't to friendly with numbers and especially if you use
a stemmer.
When it comes to the other two encodings they are different but both come
back to the same results
lucene views them as if they were European character sets and tries to apply
the same rules to them
so take care when you're indexing arabic, I only figured it out when I
started experimenting with different
unix charset settings while encoding because I have an oracle DB that spits
out the XML files on a Solaris
os and then lucene picks them up for encoding and since my core application
isn't in java I have to contend
with two web servers Main application ( AOL server ) and then search
application (Lucene on Resin).

When trying to figure out encoding issues, you need to convert everything to
it's most simple form and
compare and contrast as it passes through your application.

Nader

-----Original Message-----
From: W. Eliot Kimber [mailto:eliot@isogen.com]
Sent: Friday, June 28, 2002 6:59 PM
To: Lucene Users List
Subject: Re: Internationalization - Arabic Language Support


Peter Carlson wrote:

> The biggest part that is usually changed per language is the analyzer.
This
> is the part of Lucene which transforms and breaks up a string into
distinct
> terms.

I have only the smallest understanding of Arabic as a language, but I
have done some work to implement back-of-the-book indexing of Arabic
(and other languages) for XSL/XSLT. Based on that experience, I think
that the main challenges in implementing an Arabic analyzer would be:

1. Understanding the stemming rules for Arabic. Our research into Arabic
collation revealed that the rules for how Arabic words are formed is not
nearly as simple as in English and other Western languages. At this
point we haven't stepped up to trying to implement (or find an
implementation for) Arabic stemming for collation (words are collated
first by their roots, which are not necessarily at the start of the
words, so simple lexical collation won't work for Arabic and I'm
assuming that full-text indexing by word roots would have the same
problem). So I don't know more than that the problem is hard, even for
native speakers of Arabic.

2. Handling different letter forms in queries--Semitic languages often
have different forms for the same abstract character for different
positions in a word: initial forms, final forms, and base forms. These
different forms have different Unicode code points (although initial and
final forms are identified as such in the Unicode database). Often a
word will be stored with the base forms but the presented word will be
transformed to use the appropriate initial or final form. This means,
for example, that cutting and pasting a word from, say, a PDF document
into a query might require rationalization of variant forms to base
forms before performing the search (assuming that the analyzer also
reduces all letters to their base forms for indexing).

3. Right-to-left entry of queries and presentation of results. Mixing
right-to-left data with left-to-right data can get pretty tricky at the
user interface level (it's not an issue at the data storate level, where
all characters are stored in order of occurrence regardless of
presentation direction). Good support for bidirectional input and
presentation is hit and miss at best. For example, we could not figure
out how to get Internet Explorer to correctly present mixed English and
Arabic where there were lots of special characters (as opposed to simple
flowed prose, which seems to work OK).  I would expect Arabic localized
Web browsers to handle input OK, but it might be hard to find GUI
toolkits that do it well.

IBMs ICU4J package, a collection of national language support utilities
and libraries, might offer some solutions to this problem but I have not
yet investigated its support for Arabic and similar languages (we used
it for its Thai word breaker, which would be needed to implement a Thai
analyzer for Lucene).

Cheers,

Eliot
--
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message