lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API
Date Sat, 09 Jan 2010 00:40:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798268#action_12798268
] 

Robert Muir commented on SOLR-1710:
-----------------------------------

chris yeah, its supposed to be similar to http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29

i started by mimicing this api somewhat, i guess a future improvement would be if somehow
this truly was a real BreakIterator.
Then say, you could create a RuleBasedBreakIterator or DictionaryBasedBreakIterator (which
are fast compiled DFAs), and customize how words are delimited.
currently, you can only do this with by customizing the charTypeTable, which cannot take any
context into account, so its rather limited.

all of the above is really just theoretical and not anything we should worry about, for practical
purposes i mimiced BreakIterator api (but diverged somewhat), just because I am used to working
with it and found it was one way to separate a lot of the logic.


> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like
api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings
from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for
all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations.
The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message