lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
Date Wed, 25 Aug 2010 19:27:17 GMT


Robert Muir commented on SOLR-2059:

Hi Peter:

thats a great example. For my use case it was actually not the example either, but I was just
trying to give a good general example.

What do you think of the file format, is it ok for describing these categories? 
This format/parser is just stolen the one from MappingCharFilterFactory, it seemed unambiguous
and is already in use.

As far as applying the patch, you need to apply it to,

This is because it has to modify a file in modules, too.

> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>                 Key: SOLR-2059
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>         Attachments: SOLR-2059.patch
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message