lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4578) ICUTokenizer - per-script RBBI customization
Date Wed, 28 Nov 2012 21:57:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505945#comment-13505945
] 

Shawn Heisey commented on LUCENE-4578:
--------------------------------------

The IRC conversation:

{code}
13:51 < sarowe> btw, icutokenizer is customizable - you can find the default
                grammar at lucene/analysis/icu/src/data/uax29/Default.rbbi
13:51 < sarowe> the dafault grammar is used when the text is not thai, lao, etc.
13:51 < elyograg> i'm using solr, is any of that exposed there?
13:52 < sarowe> you'd have to rebuild Default.brk and then repackage lucene-icu
                jar, then include your rebuilt jar in solr classpath
13:52 < sarowe> that's if you don't want to rename it
13:53 < sarowe> more better would be to clone ICUTokenizer sources, tweak
                Default.rbbi, build and include a diff jar with Solr classpath
13:53 < hoss> sarowe: i don't really know anythign about ICU< but are these
              ".brk" files loaded at run time? could we make the factory take
              one as an arg?
13:53 < elyograg> I'll have to look into that.  I'd probably make a new class.
13:53 < sarowe> no
13:53 < sarowe> :)
13:53 < sarowe> I meant no to including the .brk file at runtime
13:54 < sarowe> they get compiled into some binary blob
13:54 < sarowe> that gets used by ICU
13:55 < hoss> ok .. if users can create their own *.rbbi file, and then use
              that to generate a *.brk file, and then use that to generate a
              binary blob file, could we make the ICU Factory configurable
              about which blob file to load?
13:55 < sarowe> 'ant genrbbi' under lucene/analysis/icu/ does the binary blob
                building, using the ICU RBBI rule compiler
13:55 < sarowe> hoss: hmm, potentially, yes
13:56 < sarowe> the .brk file is the binary blob, BTW
13:57 < hoss> sarowe: how long does the binary blob conversaion take?  any
              reason it couldn't be done on startup?
13:57 < hoss> (since it's aparently java code)
13:58 < sarowe> hoss: I think it's pretty fast - single digit seconds
13:59 < hoss> so maybe it could all be done by the factory?
13:59 < hoss> i have no idea if that makes sense .. just spit-balling
13:59 < sarowe> not sure how valuable runtime generation would be though
13:59 < hoss> sure ... i'm just asking because anytime the answer is "you need
              to rebuild a stock jar" i start wondering if/how we can make that
              more configurable
13:59 < sarowe> I guess one benefit is that although people would have to know
                the RBBI syntax, they wouldn't have to know how to drive the
                Lucene build
14:00 < hoss> right ... but if it's better/easier to just tell them "here's
              this little java tool, run it" that's fine too
14:00 < hoss> i'm just asking about the flexibility
14:00 < sarowe> right, the current build doesn't have any way to substitute
                alternatives
14:00 < hoss> if we can eliminate a tool and make it all magic .. people like
              magic
14:01 < sarowe> I worry about syntax errors in a runtime Solr though
14:01 < sarowe> this likely should be offline
14:01 < sarowe> I mean the .brk compilation
14:02 < hoss> fair enough ... i don't know enough about these files to know how
              likely it is people will fuck up on editing them
14:02 < sarowe> it's a syntax that only exists for that tool AFAIK
14:02 < hoss> like i said: if there's a tool people can use to generate the
              blob files, and then point at them from factory config, thta
              alone sounds like a huge win over "recompile lucene"
14:02 < sarowe> and misplaced curly brackets, e.g., is common enough in any
                language
14:03 < sarowe> I agree
14:04 < sarowe> the JFlex-based grammars, by contrast, are much more tightly
                bound with the java - recompilation is unavoidable
14:05 < hoss> i know ... it sucks ... everytime i try to think about how to
              make a really nicely confiugrable parser the whole jflex/javacc
              compile to javacode thing depressess the hell out of me
14:06 < sarowe> I think the best we can do there is introduce configuration
                knobs into the generated java
14:07 < sarowe> for example, the tokenize-punctuation proposal could be
                implemented as a boolean option on the parser(s)
14:08 < sarowe> so that by default the current behavior happens, but if people
                ask for it, they could get punctuation tokens
14:08 < sarowe> (actually, whitespace should be tokenized too, not just
                punctuation - I mean all the stuff that currently gets tossed)
14:10 < hoss> yeah ... just like autoGeneratePhraseQueries ... i'd really like
              to make whole syntax options more conifgurable though (eg:
              "should quotes be special syntax?", "what should ':' do?", all
              the things edismax currently does i na really hacky way, etc...)
14:10 < hoss> that kind of stuff is nearly impossible unless you can built up
              the grammer from sub-pieces at runtime
14:21 < rmuir> sarowe: sorry just catching up... i dont understand this talk of
               binary blogs or whatever
14:22 < rmuir> you just need this factory to support
               ICUTOkenizer(TokenizerConfig) ctor ?
14:22 < rmuir> there is no need for binary blobs or jar rebuilding
14:22 < rmuir> instead some just way to declare that ICUTokenizerConfig via xml.
14:23 < sarowe> rmuir: I'm looking at DefaultICUTokenizerConfig right now
14:23 < sarowe> that would be a pretty complex XML format
14:23 < rmuir> maybe...
14:24 < rmuir> maybe it should just be some way to override the default rather
               than redefine completely
14:24 < sarowe> and if people want to change rules, they still need to involve
                the RBBI compiler
14:24 < rmuir> no they dont
14:24 < rmuir> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#RuleBasedBreakIterator%28java.lang.String%29
14:24 < rmuir> the factory just calls that
14:25 < sarowe> that's not how it's currently done for icutokenizer though
14:25 < sarowe> in lucene I mean
14:25 < rmuir> yeah it is
14:25 < rmuir> look at icutokenizerconfig
14:25 < rmuir>   /** Return a breakiterator capable of processing a given
               script. */
14:25 < rmuir>   public abstract BreakIterator getBreakIterator(int script);
14:25 < rmuir> i dont understand whats confusing about this :)
14:25 < rmuir> if you make your own config, you can return one you made with
               RBBI(String)
14:26 < rmuir> the *default* implementation is just optimized via these .brks
               so it loads faster.
14:26 < sarowe> aha, so as i said, the current impl doesn't do this
14:26 < rmuir> you said the icutokenizer
14:26 < rmuir> it has no impl
14:26 < rmuir> it has a default config really
14:26 < rmuir> the tokenizer doesnt know about any of this
14:26 < sarowe> ok, I get it
14:27 < rmuir> anyway there is a TODO in the factory about this
14:27 < rmuir> would be nice if someone has ideas on what could be a useful
               format
14:28 < sarowe> cool
14:28 < rmuir> i dont think it needs to support all the possible customizations
               (e.g. typing things differently and so on)
14:28 < rmuir> just something simple
14:28 < rmuir> maybe some way to "@override" the default one
14:28 < rmuir> like you give it a list of script/textfile pairs
14:32 < rmuir> if you use that RBBI(String) ctor it fully syntax checks and
               stuff also
14:32 < sarowe> cool
14:32 < rmuir> actually if you look at the compiler i have to redundantly call
               that just for that reason :)
14:32 < rmuir> the general approach is you do this once, and then .clone()
               whatever you make
14:33 < rmuir> like collators
14:35 < sarowe> what's redundant?  I don't see that
14:36 < rmuir> in the compiler?
14:36 < rmuir>       /*
14:36 < rmuir>        * if there is a syntax error, compileRules() may succeed.
               the way to
14:36 < rmuir>        * check is to try to instantiate from the string.
               additionally if the
14:36 < rmuir>        * rules are invalid, you can get a useful syntax error.
14:36 < rmuir>        */
14:36 < rmuir>       try {
14:36 < rmuir>         new RuleBasedBreakIterator(rules);
14:36 < rmuir>       } catch (IllegalArgumentException e) {
14:36 < rmuir>         /*
14:36 < rmuir>          * do this intentionally, so you don't get a massive
               stack trace
14:36 < rmuir>          * instead, get a useful syntax error!
14:36 < rmuir>          */
14:36 < rmuir>         System.err.println(e.getMessage());
14:36 < rmuir>         System.exit(1);
14:36 < rmuir>       }
14:36 < sarowe> oh!
14:36 < rmuir> thats why i said its actually the opposite of what you might
               think
14:36 < sarowe> I was missing the fact that the compiler is in lucene and not
                in ICU
14:37 < rmuir> compileRules misses the checking
14:37 < rmuir> but the string-ctor checks
14:37 < rmuir> (at least at the time i wrote this thing thats how it worked)
14:37 < sarowe> cool
14:38 < rmuir> anyway, would be cool if we could enhance the factory
14:38 < rmuir> as the rules are documented and what not pretty well
14:38 < sarowe> agreed
14:39 < sarowe> and if people came up with useful alternatives, those could be
                bundled and chosen from
14:39 < sarowe> i mean useful default.rbbi alternatives
14:40 < sarowe> but this could also be used to cover other scripts not
                currently covered
14:40 < rmuir> i think its actually less useful to customize that?
14:40 < sarowe> oh?
14:40 < rmuir> i think its more useful to say 'i want to change latin script
               without having to worry about fucking up chinese'
14:40 < sarowe> what do you think people will want to customize
14:40 < rmuir> like
14:40 < rmuir> it doesnt give you any additional 'power'
14:40 < rmuir> it just makes writing the rules easier?
14:41 < rmuir> so thats why i suggested script/text file pairs
14:41 < sarowe> aha, so you're advocating leaving the default as-is
14:41 < sarowe> and interposing script-specific additions
14:41 < rmuir> well ive customized it this way before, and i'm describing how i
               did it
14:41 < sarowe> :) cool, ok
14:41 < rmuir> because the idea is: i dont want to have to deal with
               researching the impacts on tons of other languages or dealing
               with the categories and shit
14:41 < elyograg> my customizations would involve stopping it from doing
                  something that it currently does -- tokenize on punctuation.
14:42 < rmuir> right, but you shouldnt change the default to accomplish that
14:42 < rmuir> because such a thing would make horribly long strings if you got
               some tibetan text
14:43 < rmuir> you should just just say 'i want to do this for latin script'
14:43 < elyograg> much of the discussion here is going right over my head, but
                  you're right, I would only want to make this happen for
                  latin.  punctuation for other languages is outside my
                  understanding. :)
14:44 < rmuir> right so thats the idea of a script/text file pair
14:44 < rmuir> you dont have to deal with that stuff. its not more expressive,
               just makes it simpler
14:46 < rmuir> what you want to do is easy if the factory adds supprot for that
14:46 < rmuir> its like the hebrew example, it has two customizations (just a
               set add/subtraction if i remember)
14:46 < rmuir> so it wont split on " and '
14:49 < sarowe> so elyograg's customization could be a copy/paste of the
                Default.rbbi, and just add the punctuation he wants to keep to
                the $MidLetter definition
14:50 < rmuir> yeah. like if we do this there should be some example
               default.rbbi sitting there so this is easy to do
14:50 < sarowe> (assuming numbers aren't involved)
14:51 < rmuir> i personaly try to prefer the tweaking the way you suggest, as
               just customizing the grammar that way.
14:51 < rmuir> but you can also do something simpler too (e.g. something that
               looks more like a regexp, since it only worries about latin...)
14:52 < sarowe> a range of examples showing both these strategies would be good
14:52 < rmuir> look at the provided ones :)
14:52 < sarowe> you mean Hebrew, Khmer, Lao, and Myanmar I assume
14:52 < rmuir> yes
14:53 < rmuir> i guess those arent great for examples for users :)
14:53 < sarowe> right, but naive users won't easily understand those
14:53 < sarowe> right
{code}
                
> ICUTokenizer - per-script RBBI customization
> --------------------------------------------
>
>                 Key: LUCENE-4578
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4578
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>            Reporter: Shawn Heisey
>             Fix For: 4.1, 5.0
>
>
> Initially this started out as an idea for a configuration knob on ICUTokenizer that would
allow me to tell it not to tokenize on punctuation.  Through IRC discussion on #lucene, it
sorta ballooned.  The committers had a long discussion about it that I don't really understand,
so I'll be including it in the comments.
> I am a Solr user, so I would also need the ability to access the configuration from there,
likely either in schema.xml or solrconfig.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message