lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Sun, 07 Nov 2010 18:33:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929382#action_12929382
] 

Steven Rowe commented on LUCENE-2167:
-------------------------------------

bq. So i think its just intuitive and becoming rather universal to put quotes around things
to get a "more exact search".

You've convinced me, though I don't think this idea has been around long enough to qualify
as intiutive.

bq. hostnames are just an example, why do we recognize them and not filenames?

Although following precedent is important (principle of least surprise), we have to be able
to revisit these decisions.  My philosophy tends toward kitchen-sinkness, while allowing people
to ignore the stuff they don't want (today).  So, yeah, I think we *should* (be able to) recognize
filenames, at least as part of a URL-decomposing filter:

{noformat}http://www.example.com/path/file%20name.html?param=value#fragment{noformat}
=> 
{noformat}http://www.example.com/path/file%20name.html?param=value#fragment{noformat} <URL>
www.example.com <HOSTNAME>
example.com <HOSTNAME>
example <HOSTNAME>
com <HOSTNAME>
path <URL_PATH_ELEMENT>
file name.html <URL_FILENAME>
file name <URL_FILENAME>
file <URL_FILENAME>
name <URL_FILENAME>
html <URL_FILENAME>
param <URL_PARAMETER>
value <URL_PARAMETER_VALUE>
fragment <URL_FRAGMENT>

Output of each token type could be optional in a URL decomposition filter.  The URL decomposition
filter could serve as a place to handle punycode, too.

bq. i'm not too picky how we solve the problem, but i think UAX#29 is a great default... its
used everywhere else...

I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides
the same thing.  So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer
as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current
StandardTokenizer).  Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer
that provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component
tokens?

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch,
LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch,
LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message