lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Fri, 14 May 2010 21:58:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867697#action_12867697
] 

Steven Rowe commented on LUCENE-2167:
-------------------------------------

I think UAX29Tokenizer should remain as-is, except that I think there are some valid letter
chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons, as
CJ chars are now.  I need to augment the tests and make sure that valid word/number chars
are not being dropped.  Also, I want to add full-width numeric chars to the {NumericEx} macro.

A separate replacement StandardTokenizer class should have standards-based email and url tokenization
- the current StandardTokenizer gets part of the way there, but doesn't support some valid
emails, and while it recognizes host/domain names, it doesn't recognize full URLs.  I want
to get this done before anything in this issue is committed.

Then (after this issue is committed), in separate issues, we can add EnglishTokenizer (for
things like acronyms and maybe removing posessives (current StandardFilter), and then as needed,
other language-specific tokenizers.

bq. I still want to rip StandardTokenizer out of lucene core and into modules. I think thats
not too far away and its probably better to do this afterwards?, but we can do it before that
time if you want, doesn't matter to me.

I'll finish the UAX29Tokenizer fixes this weekend, but I think it'll take me a week or so
to get the URL/email tokenization in place.

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message