lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Karich (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
Date Wed, 25 Aug 2010 19:16:16 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902588#action_12902588
] 

Peter Karich commented on SOLR-2059:
------------------------------------

Robert,

thanks for this work! I have a different application for this patch: in a twitter search #
and @ shouldn't be removed. Instead I will handle them like ALPHA, I think.

Would you mind to update the patch for the latest version of the trunk? I got a problem with
WordDelimiterIterator at line 254 if I am using https://svn.apache.org/repos/asf/lucene/dev/trunk/solr
and a file is missing problem (line 37) for http://svn.apache.org/repos/asf/solr

> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode
Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type
table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message