Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 69958 invoked from network); 7 Nov 2010 20:15:57 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Nov 2010 20:15:57 -0000 Received: (qmail 63992 invoked by uid 500); 7 Nov 2010 20:16:28 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 63942 invoked by uid 500); 7 Nov 2010 20:16:27 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 63935 invoked by uid 99); 7 Nov 2010 20:16:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 20:16:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 20:16:26 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA7KG6Uc010202 for ; Sun, 7 Nov 2010 20:16:06 GMT Message-ID: <155782.64611289160966236.JavaMail.jira@thor> Date: Sun, 7 Nov 2010 15:16:06 -0500 (EST) From: "Steven Rowe (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on In-Reply-To: <7896301.57801289076022468.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929392#action_12929392 ] Steven Rowe commented on LUCENE-2745: ------------------------------------- bq. steven, check out the link at the bottom of that article. Yup, did that. bq. especially the top... it explains the use in the language, particularly to block cursive joining for prefixes, suffixes, compounds. we split on this and the affixes are in the stoplist Um, like I said, Persian uses ZWNJs as display hints, not as word separators. According to the [ICU web demo|http://demo.icu-project.org/icu-bin/ubrowse?go=200C], ZWNJs have the \p{Word_Break:Extend} property, so the Lucene UAX#29-based tokenizers will *not* split on this char. What am I not getting? > ArabicAnalyzer - the ability to recognise email addresses host names and so on > ------------------------------------------------------------------------------ > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All > Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, > adam@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to [adam@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org