lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Created: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
Date Thu, 19 Aug 2010 13:50:16 GMT
Allow customizing how WordDelimiterFilter tokenizes text.
---------------------------------------------------------

                 Key: SOLR-2059
                 URL: https://issues.apache.org/jira/browse/SOLR-2059
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis
            Reporter: Robert Muir
            Priority: Minor
             Fix For: 3.1, 4.0
         Attachments: SOLR-2059.patch

By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
Based on these types and the options provided, it splits and concatenates text.

In some circumstances, you might need to tweak the behavior of how this works.
It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
But its not exposed in the factory.

I think you should be able to customize the defaults with a configuration file:
{noformat}
# A customized type mapping for WordDelimiterFilterFactory
# the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
# 
# the default for any character without a mapping is always computed from 
# Unicode character properties

# Map the $, %, '.', and ',' characters to DIGIT 
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\u002C => DIGIT
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message