lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
Date Thu, 29 Mar 2012 15:50:28 GMT


Christian Moen commented on LUCENE-3935:


Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding
of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly
those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML
markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after
the change.  The method is called very, very frequently on indexing, and it's total CPU contribution
is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, but
this number looks correct to me.  Would be great to get your feedback if this is in line with
expectations, Dawid and Robert.

Do we still want to apply this?

> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>                 Key: LUCENE-3935
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int
forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times
and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure
is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.
 (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least
save one bounds check, a pointer reference, and we should also get much better cache utilization
since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message