lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Safarnejad, Ali (AFIS)" <Ali.Safarne...@fao.org>
Subject RE: Aramorph Analyzer
Date Thu, 16 Dec 2004 10:59:23 GMT
Actually, one thing worth mentioning about the search, is when searching for
whole phrases, if there is any ambiguous words in the phrase, then the Search
fails to find the document, even if the phrase was copied and pasted from the
original document.
So for example, I have a document containing this phrase: الأجهـــزة الرياسية
للمنظمة
The first two words only have one stem, but the last word has two stems:
munaZ~im AND munaZ~am,
So the entire search query becomes: "Al>jh___zp riyAsiy~ munaZ~im munaZ~am"
Which fails to find any matching documents.
Whereas, a search for "Al>jh___zp riyAsiy~" would succeed.
Even placing the accent over the ZAH (ظ), will not disambiguate the search.
Has anyone found a workaround for this?

ali


-----Original Message-----
From: Safarnejad, Ali (AFIS) 
Sent: 16 December 2004 10:23
To: Lucene Users List
Subject: Aramorph Analyzer


I wanted to share some results from trying out Aramorph Arabic Analyzer with
Lucene.  I experimented with a set of 100 web documents in Windows-1256
encoding.  The indexing took just over 200 seconds, although I had to
increase the heap-size to 500Meg, or I would get OutOfMemory Exceptions
halfway thru the documents.  The 200 seconds includes time to make the url
connection and tidy the documents to extract the text out.

Has anyone done similar experiments with a larger set of Arabic documents?
I'm interested in hearing from anyone else who has used Aramorph with Lucene.

Thanks,
Ali

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Mime
View raw message