Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 99578 invoked from network); 7 Nov 2010 21:07:58 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Nov 2010 21:07:58 -0000 Received: (qmail 8921 invoked by uid 500); 7 Nov 2010 21:08:29 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 8873 invoked by uid 500); 7 Nov 2010 21:08:29 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 8866 invoked by uid 99); 7 Nov 2010 21:08:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 21:08:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 21:08:28 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA7L88ec010455 for ; Sun, 7 Nov 2010 21:08:08 GMT Message-ID: <16963969.64741289164088150.JavaMail.jira@thor> Date: Sun, 7 Nov 2010 16:08:08 -0500 (EST) From: "Steven Rowe (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on In-Reply-To: <7896301.57801289076022468.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929396#action_12929396 ] Steven Rowe commented on LUCENE-2745: ------------------------------------- {quote} this is how the whole analyzer works, more examples in the tests... I can give you more refs later, when I have better bandwidth... but its specific to this language. we shouldn't split on it in general... also often a real space is used instead, so this approach is the simplest for the language {quote} AFAICT, ArabicLetterTokenizer just adds non-spacing marks to the list of acceptable token characters, so they won't be used to split words. However, ZWNJ (U+200C) has the "Cf" -- Format -- general category, *not* the "Mn" general category (non-spacing marks), so as far as I can tell, the current Lucene ArabicLetterTokenizer (and hence ArabicAnalyzer) splits on ZWNJ. None of the tests in TestArabicLetterTokenizer nor in TestArabicAnalyzer contain ZWNJ (U+200C). Maybe what I'm not understanding is "this approach" in your quote above. Can you describe "this approach"? When you wrote "we split on this and the affixes are in the stoplist" did you mean that ArabicLetterTokenizer *intentionally* breaks Persian words at ZWNJ? And then throws away the affixes that result? Hunh???? > ArabicAnalyzer - the ability to recognise email addresses host names and so on > ------------------------------------------------------------------------------ > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All > Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, > adam@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to [adam@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org