lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Mon, 10 May 2010 18:54:30 GMT


Steven Rowe commented on LUCENE-2167:

Good point, Marvin - indexing URLs makes no sense without query support for them.  (Is this
a stupid can of worms for me to have opened?)  I have used Lucene tokenizers for other things
than retrieval (e.g. term vectors as input to other processes), and I suspect I'm not alone.
The ability to extract URLs would be very nice.

Ideally, URL analysis would produce both the full URL as a single token, and as overlapping
tokens the hostname, path components, etc.  However, I don't think it's a good idea for the
tokenizer to output overlapping tokens - I suspect this would break more than a few things.

A filter that breaks URL type tokens into their components, and then adds them as overlapping
tokens, or replaces the full URL with the components, should be easy to write, though.

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>                 Key: LUCENE-2167
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message