lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
Date Sun, 07 Nov 2010 21:26:06 GMT


Steven Rowe commented on LUCENE-2745:

bq. Hunh????

Okay, I think I get it now.  

I did a search for U+200C in the whole Lucene project, and I found TestPersianAnalyzer.

Apparently, Robert, when you said "the whole analyzer" and "this approach" you meant PersianAnalyzer,
rather than ArabicAnalyzer.  Sorry for the confusion.

What do you think the approach should be for Persian?  Maybe a StandardTokenizer clone that
excludes ZWNJ from the \p{Word_Break:Extend} class that gets added to every rule?  I'll see
if there is some way to compose a PersianTokenizer.jflex (using the %include directive maybe?)
using StandardTokenizerImpl.jflex, so that we don't end up with code duplication.

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> ------------------------------------------------------------------------------
>                 Key: LUCENE-2745
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
>         Environment: All
>            Reporter: M Alexander
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example,
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to []. The
same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message