lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Mon, 10 May 2010 16:50:18 GMT


Steven Rowe commented on LUCENE-2167:

bq. One other thing, Robert: what do you think of adding URL tokenization?

I think I would lean towards not doing this, only because of how complex a URL can be these
days. It also starts to get a little ambiguous and will likely interfere with other rules
(generating a lot of false positives).

I have written standards-based URL tokenization routines in the past.  I agree it's very complex,
but I know it's do-able.

Do you have some examples of false positives?  I'd like to add tests for them.

bq. I guess I don't care much either way, if its strict and standards-based, it probably won't
cause any harm.  But if you start allowing things like http urls without the http:// being
present, its gonna cause some problems.

Yup, I would only accept strictly correct URLs.

Now that international TLDs are a reality, it would be cool to be able to identify them.

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>                 Key: LUCENE-2167
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message