lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: [jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow
Date Tue, 21 Aug 2007 09:49:03 GMT
just for completeness of the approaches (I think the speed-up to expect can be, in the best
case, just about to be measurable considering big picture)  

I had very nice experience with simple Bloom filter that "approximately hashes" characters
that are repeated in switch statement.
If Bloom filter contains current char, we go and execute switch, if not we simply go on. Even
with bigger number of false positives, in average case, it works faster. This depends heavily
on number of chars in switch() statement, but in case this number is bigger we can extend
filter bit length to long in order to reduce the number of false positives. 

I have not tried this approach on this concrete example, but very similar situation.


something along the lines:

static private int buildFilter( final char[] s, final int len ) {
        int i = len, bFilter = 0;
        while ( i-- != 0 ) bFilter |= 1 << ( s[ i ] & 0x1f );
        return bFilter;
    }


and than you need to check:

char c = ... to check
if ((bFilter  &  ( 1 << ( c & 0x1f ) ) ) == 0)


----- Original Message ----
From: Dawid Weiss (JIRA) <jira@apache.org>
To: java-dev@lucene.apache.org
Sent: Tuesday, 21 August, 2007 10:51:31 AM
Subject: [jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow


    [ https://issues.apache.org/jira/browse/LUCENE-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521361
] 

Dawid Weiss commented on LUCENE-871:
------------------------------------

I was a bit curious about it, so I decided to write a table-lookup version. It does come out
faster, but only by a small margin (especially on "server", hotspot JVMs). 

Timings are in milliseconds, the round consisted of 100000 repetitions of parsing the test
string "Des mot clés À LA CHAÎNE À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ
Ò Ó Ô Õ Ö Ø Œ Þ Ù Ú Û Ü Ý Ÿ à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ø œ ß þ ù ú û ü ý ÿ". Note it is biased since most characters
do have accents, which will not be the case in real life I gues... but still:

// SUN JVM build 1.6.0-b105, -server mode
Round (old): 1922
Round (old): 1688
Round (old): 1656
Round (old): 1687
Round (old): 1641
Round (old): 1703
Round (old): 1672
Round (old): 1672
Round (old): 1687
Round (old): 1719
Round (new): 1719
Round (new): 1609
Round (new): 1609
Round (new): 1594
Round (new): 1625
Round (new): 1578
Round (new): 1625
Round (new): 1594
Round (new): 1625
Round (new): 1656

// SUN JVM, 1.6.0, interpreted (-client)

Round (old): 2391
Round (old): 2453
Round (old): 2359
Round (old): 2375
Round (old): 2391
Round (old): 2359
Round (old): 2156
Round (old): 2532
Round (old): 2422
Round (old): 2359
Round (new): 1969
Round (new): 1906
Round (new): 1922
Round (new): 1937
Round (new): 1985
Round (new): 1922
Round (new): 1906
Round (new): 1937
Round (new): 1985
Round (new): 1922

// IBM JVM 1.5.0 (don't know why it's so sluggish, really).

Round (old): 7906
Round (old): 7188
Round (old): 7625
Round (old): 7312
Round (old): 7266
Round (old): 7141
Round (old): 7015
Round (old): 5641
Round (old): 5578
Round (old): 5672
Round (new): 4656
Round (new): 4406
Round (new): 4516
Round (new): 4516
Round (new): 4375
Round (new): 4375
Round (new): 4343
Round (new): 4297
Round (new): 4344
Round (new): 4266

// IBM 1.5.0, -server (note the speed improvement when the old version is hotspot-optimized).

Round (old): 5922
Round (old): 5078
Round (old): 5078
Round (old): 5062
Round (old): 4985
Round (old): 4875
Round (old): 4953
Round (old): 4641
Round (old): 3640
Round (old): 3735
Round (new): 3750
Round (new): 3781
Round (new): 3656
Round (new): 3516
Round (new): 3515
Round (new): 3594
Round (new): 3547
Round (new): 3562
Round (new): 3532
Round (new): 3531

So... it does come out a bit faster. Whether it makes sense to waste 130 kb of memory for
this improvement.... don't know, really. I'll upload the table-lookup version for your reference.

> ISOLatin1AccentFilter a bit slow
> --------------------------------
>
>                 Key: LUCENE-871
>                 URL: https://issues.apache.org/jira/browse/LUCENE-871
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1, 2.2
>            Reporter: Ian Boston
>            Assignee: Michael McCandless
>             Fix For: 2.3
>
>         Attachments: fasterisoremove1.patch, fasterisoremove2.patch, ISOLatin1AccentFilter.java.patch,
LUCENE-871.take4.patch
>
>
> The ISOLatin1AccentFilter is a bit slow giving 300+ ms responses when used in a highligher
for output responses.
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message