lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4578) ICUTokenizer - per-script RBBI customization
Date Wed, 28 Nov 2012 21:57:59 GMT


Shawn Heisey commented on LUCENE-4578:

The IRC conversation:

13:51 < sarowe> btw, icutokenizer is customizable - you can find the default
                grammar at lucene/analysis/icu/src/data/uax29/Default.rbbi
13:51 < sarowe> the dafault grammar is used when the text is not thai, lao, etc.
13:51 < elyograg> i'm using solr, is any of that exposed there?
13:52 < sarowe> you'd have to rebuild Default.brk and then repackage lucene-icu
                jar, then include your rebuilt jar in solr classpath
13:52 < sarowe> that's if you don't want to rename it
13:53 < sarowe> more better would be to clone ICUTokenizer sources, tweak
                Default.rbbi, build and include a diff jar with Solr classpath
13:53 < hoss> sarowe: i don't really know anythign about ICU< but are these
              ".brk" files loaded at run time? could we make the factory take
              one as an arg?
13:53 < elyograg> I'll have to look into that.  I'd probably make a new class.
13:53 < sarowe> no
13:53 < sarowe> :)
13:53 < sarowe> I meant no to including the .brk file at runtime
13:54 < sarowe> they get compiled into some binary blob
13:54 < sarowe> that gets used by ICU
13:55 < hoss> ok .. if users can create their own *.rbbi file, and then use
              that to generate a *.brk file, and then use that to generate a
              binary blob file, could we make the ICU Factory configurable
              about which blob file to load?
13:55 < sarowe> 'ant genrbbi' under lucene/analysis/icu/ does the binary blob
                building, using the ICU RBBI rule compiler
13:55 < sarowe> hoss: hmm, potentially, yes
13:56 < sarowe> the .brk file is the binary blob, BTW
13:57 < hoss> sarowe: how long does the binary blob conversaion take?  any
              reason it couldn't be done on startup?
13:57 < hoss> (since it's aparently java code)
13:58 < sarowe> hoss: I think it's pretty fast - single digit seconds
13:59 < hoss> so maybe it could all be done by the factory?
13:59 < hoss> i have no idea if that makes sense .. just spit-balling
13:59 < sarowe> not sure how valuable runtime generation would be though
13:59 < hoss> sure ... i'm just asking because anytime the answer is "you need
              to rebuild a stock jar" i start wondering if/how we can make that
              more configurable
13:59 < sarowe> I guess one benefit is that although people would have to know
                the RBBI syntax, they wouldn't have to know how to drive the
                Lucene build
14:00 < hoss> right ... but if it's better/easier to just tell them "here's
              this little java tool, run it" that's fine too
14:00 < hoss> i'm just asking about the flexibility
14:00 < sarowe> right, the current build doesn't have any way to substitute
14:00 < hoss> if we can eliminate a tool and make it all magic .. people like
14:01 < sarowe> I worry about syntax errors in a runtime Solr though
14:01 < sarowe> this likely should be offline
14:01 < sarowe> I mean the .brk compilation
14:02 < hoss> fair enough ... i don't know enough about these files to know how
              likely it is people will fuck up on editing them
14:02 < sarowe> it's a syntax that only exists for that tool AFAIK
14:02 < hoss> like i said: if there's a tool people can use to generate the
              blob files, and then point at them from factory config, thta
              alone sounds like a huge win over "recompile lucene"
14:02 < sarowe> and misplaced curly brackets, e.g., is common enough in any
14:03 < sarowe> I agree
14:04 < sarowe> the JFlex-based grammars, by contrast, are much more tightly
                bound with the java - recompilation is unavoidable
14:05 < hoss> i know ... it sucks ... everytime i try to think about how to
              make a really nicely confiugrable parser the whole jflex/javacc
              compile to javacode thing depressess the hell out of me
14:06 < sarowe> I think the best we can do there is introduce configuration
                knobs into the generated java
14:07 < sarowe> for example, the tokenize-punctuation proposal could be
                implemented as a boolean option on the parser(s)
14:08 < sarowe> so that by default the current behavior happens, but if people
                ask for it, they could get punctuation tokens
14:08 < sarowe> (actually, whitespace should be tokenized too, not just
                punctuation - I mean all the stuff that currently gets tossed)
14:10 < hoss> yeah ... just like autoGeneratePhraseQueries ... i'd really like
              to make whole syntax options more conifgurable though (eg:
              "should quotes be special syntax?", "what should ':' do?", all
              the things edismax currently does i na really hacky way, etc...)
14:10 < hoss> that kind of stuff is nearly impossible unless you can built up
              the grammer from sub-pieces at runtime
14:21 < rmuir> sarowe: sorry just catching up... i dont understand this talk of
               binary blogs or whatever
14:22 < rmuir> you just need this factory to support
               ICUTOkenizer(TokenizerConfig) ctor ?
14:22 < rmuir> there is no need for binary blobs or jar rebuilding
14:22 < rmuir> instead some just way to declare that ICUTokenizerConfig via xml.
14:23 < sarowe> rmuir: I'm looking at DefaultICUTokenizerConfig right now
14:23 < sarowe> that would be a pretty complex XML format
14:23 < rmuir> maybe...
14:24 < rmuir> maybe it should just be some way to override the default rather
               than redefine completely
14:24 < sarowe> and if people want to change rules, they still need to involve
                the RBBI compiler
14:24 < rmuir> no they dont
14:24 < rmuir>
14:24 < rmuir> the factory just calls that
14:25 < sarowe> that's not how it's currently done for icutokenizer though
14:25 < sarowe> in lucene I mean
14:25 < rmuir> yeah it is
14:25 < rmuir> look at icutokenizerconfig
14:25 < rmuir>   /** Return a breakiterator capable of processing a given
               script. */
14:25 < rmuir>   public abstract BreakIterator getBreakIterator(int script);
14:25 < rmuir> i dont understand whats confusing about this :)
14:25 < rmuir> if you make your own config, you can return one you made with
14:26 < rmuir> the *default* implementation is just optimized via these .brks
               so it loads faster.
14:26 < sarowe> aha, so as i said, the current impl doesn't do this
14:26 < rmuir> you said the icutokenizer
14:26 < rmuir> it has no impl
14:26 < rmuir> it has a default config really
14:26 < rmuir> the tokenizer doesnt know about any of this
14:26 < sarowe> ok, I get it
14:27 < rmuir> anyway there is a TODO in the factory about this
14:27 < rmuir> would be nice if someone has ideas on what could be a useful
14:28 < sarowe> cool
14:28 < rmuir> i dont think it needs to support all the possible customizations
               (e.g. typing things differently and so on)
14:28 < rmuir> just something simple
14:28 < rmuir> maybe some way to "@override" the default one
14:28 < rmuir> like you give it a list of script/textfile pairs
14:32 < rmuir> if you use that RBBI(String) ctor it fully syntax checks and
               stuff also
14:32 < sarowe> cool
14:32 < rmuir> actually if you look at the compiler i have to redundantly call
               that just for that reason :)
14:32 < rmuir> the general approach is you do this once, and then .clone()
               whatever you make
14:33 < rmuir> like collators
14:35 < sarowe> what's redundant?  I don't see that
14:36 < rmuir> in the compiler?
14:36 < rmuir>       /*
14:36 < rmuir>        * if there is a syntax error, compileRules() may succeed.
               the way to
14:36 < rmuir>        * check is to try to instantiate from the string.
               additionally if the
14:36 < rmuir>        * rules are invalid, you can get a useful syntax error.
14:36 < rmuir>        */
14:36 < rmuir>       try {
14:36 < rmuir>         new RuleBasedBreakIterator(rules);
14:36 < rmuir>       } catch (IllegalArgumentException e) {
14:36 < rmuir>         /*
14:36 < rmuir>          * do this intentionally, so you don't get a massive
               stack trace
14:36 < rmuir>          * instead, get a useful syntax error!
14:36 < rmuir>          */
14:36 < rmuir>         System.err.println(e.getMessage());
14:36 < rmuir>         System.exit(1);
14:36 < rmuir>       }
14:36 < sarowe> oh!
14:36 < rmuir> thats why i said its actually the opposite of what you might
14:36 < sarowe> I was missing the fact that the compiler is in lucene and not
                in ICU
14:37 < rmuir> compileRules misses the checking
14:37 < rmuir> but the string-ctor checks
14:37 < rmuir> (at least at the time i wrote this thing thats how it worked)
14:37 < sarowe> cool
14:38 < rmuir> anyway, would be cool if we could enhance the factory
14:38 < rmuir> as the rules are documented and what not pretty well
14:38 < sarowe> agreed
14:39 < sarowe> and if people came up with useful alternatives, those could be
                bundled and chosen from
14:39 < sarowe> i mean useful default.rbbi alternatives
14:40 < sarowe> but this could also be used to cover other scripts not
                currently covered
14:40 < rmuir> i think its actually less useful to customize that?
14:40 < sarowe> oh?
14:40 < rmuir> i think its more useful to say 'i want to change latin script
               without having to worry about fucking up chinese'
14:40 < sarowe> what do you think people will want to customize
14:40 < rmuir> like
14:40 < rmuir> it doesnt give you any additional 'power'
14:40 < rmuir> it just makes writing the rules easier?
14:41 < rmuir> so thats why i suggested script/text file pairs
14:41 < sarowe> aha, so you're advocating leaving the default as-is
14:41 < sarowe> and interposing script-specific additions
14:41 < rmuir> well ive customized it this way before, and i'm describing how i
               did it
14:41 < sarowe> :) cool, ok
14:41 < rmuir> because the idea is: i dont want to have to deal with
               researching the impacts on tons of other languages or dealing
               with the categories and shit
14:41 < elyograg> my customizations would involve stopping it from doing
                  something that it currently does -- tokenize on punctuation.
14:42 < rmuir> right, but you shouldnt change the default to accomplish that
14:42 < rmuir> because such a thing would make horribly long strings if you got
               some tibetan text
14:43 < rmuir> you should just just say 'i want to do this for latin script'
14:43 < elyograg> much of the discussion here is going right over my head, but
                  you're right, I would only want to make this happen for
                  latin.  punctuation for other languages is outside my
                  understanding. :)
14:44 < rmuir> right so thats the idea of a script/text file pair
14:44 < rmuir> you dont have to deal with that stuff. its not more expressive,
               just makes it simpler
14:46 < rmuir> what you want to do is easy if the factory adds supprot for that
14:46 < rmuir> its like the hebrew example, it has two customizations (just a
               set add/subtraction if i remember)
14:46 < rmuir> so it wont split on " and '
14:49 < sarowe> so elyograg's customization could be a copy/paste of the
                Default.rbbi, and just add the punctuation he wants to keep to
                the $MidLetter definition
14:50 < rmuir> yeah. like if we do this there should be some example
               default.rbbi sitting there so this is easy to do
14:50 < sarowe> (assuming numbers aren't involved)
14:51 < rmuir> i personaly try to prefer the tweaking the way you suggest, as
               just customizing the grammar that way.
14:51 < rmuir> but you can also do something simpler too (e.g. something that
               looks more like a regexp, since it only worries about latin...)
14:52 < sarowe> a range of examples showing both these strategies would be good
14:52 < rmuir> look at the provided ones :)
14:52 < sarowe> you mean Hebrew, Khmer, Lao, and Myanmar I assume
14:52 < rmuir> yes
14:53 < rmuir> i guess those arent great for examples for users :)
14:53 < sarowe> right, but naive users won't easily understand those
14:53 < sarowe> right
> ICUTokenizer - per-script RBBI customization
> --------------------------------------------
>                 Key: LUCENE-4578
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>            Reporter: Shawn Heisey
>             Fix For: 4.1, 5.0
> Initially this started out as an idea for a configuration knob on ICUTokenizer that would
allow me to tell it not to tokenize on punctuation.  Through IRC discussion on #lucene, it
sorta ballooned.  The committers had a long discussion about it that I don't really understand,
so I'll be including it in the comments.
> I am a Solr user, so I would also need the ability to access the configuration from there,
likely either in schema.xml or solrconfig.xml.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message