lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Sun, 07 Nov 2010 19:19:06 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929385#action_12929385
] 

Robert Muir commented on LUCENE-2167:
-------------------------------------

{quote}
You've convinced me, though I don't think this idea has been around long enough to qualify
as intiutive.
{quote}

Well obviously i dont have hard references to this stuff, but from my interaction with my
own users, most of them
dont even think of double quotes as doing phrases, nor are they technical enough to even know
what a phrase
is or what that means for a search... they just think of it as more exact.

{quote}
I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides
the same thing. So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer
as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current
StandardTokenizer). Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that
provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component
tokens?
{quote}

Well, like i said, i'm not particularly picky, especially since someone can always use ClassicTokenizer
to get the old behavior,
which, no one could ever agree on and there was constantly issues about not recognizing my
company's name etc etc.

To some extent, i like UAX#29 because there's someone else making and standardizing the decisions
and validating
its not gonna annoy users of major languages, and making sure it works well by default: like
its not gonna be the most 
full-featured tokenizer but theres little chance it will be really annoying: i think this
is great for "defaults".

as for all the other "bonus" stuff we can always make options, especially if its some pluggable
thing somehow (sorry not sure about how this could work in jflex)
where you could have options as to what you want to do.

but again, i think UAX#29 itself is more than sufficient by default, and even hostname etc
is pretty dangerous *by default* 
(again my example of searching partial hostnames being flexible to the end-user and not baked-in,
by letting them using quotes).


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch,
LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch,
LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message