lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (LUCENE-937) Make CachingTokenFilter faster
Date Fri, 22 Jun 2007 12:50:27 GMT


Mark Miller commented on LUCENE-937:

I am testing on Java 1.5

I tested with LinkedList using get, LinkedList using iterator, ArrayList() (defaults to 10),
ArrayList(16), ArrayList(30), ArrayList(60)

For a handful of tokens, all methods are about the same speed. (3-10 tokens) LinkedList, ArrayList(16)
and ArrayList are a smidgen faster than ArrayList(30-60) though.

At 15-40 tokens all of the methods are still about the same speed, though ArrayList(30) may
be a hair faster (than even ArrayList(10))

At 50-300 tokens (weighted towards 50), you see a 47% increase in speed using ArrayList(16
or 30) and ArrayList(60), the other methods are about the same speed.

At 150-900 tokens (weighted towards 150), you see a 100% increase in speed using ArrayList(30)
-- ArrayList(60) is no better,
the other methods take twice as long (even ArrayList(10)).

I used Reuters data, shortening it and stiching it together for the various sizes.

The first test was the average speed of about 21,000 runs on 21,000 different docs.

The second test was the average speed of about 21,000 runs on 21,000 different docs.

The third test was the average speed of about 15,000 runs on 15,000 different docs.

The fourth test was the average speed of about 5,000 runs on 5,000 different docs.

I don't completely understand why, but ArrayList(30) or ArrayList(16) blow everything else
out of the water. ArrayList(16) is a smidgen faster at very low
token counts, while ArrayList(30) is a smidgen faster above 50 or so tokens. The differences
are almost negligible though.

ArrayList(30) or ArrayList(16) are approx the same speed as LinkedList (get and iterator are
always approx the same speed)
 at low numbers and get better and better VERY quickly as the number of tokens goes up.

I'd recommend ArrayList(16) as the best choice for this class. It is no worse than LinkedList
on very small documents, and much better than LinkedList on small to large documents.

A friend was recently telling me that ArrayList defaulted to 16, but it does not -- it defaults
to 10. He must have been confused and known that it *should* default to 16. 16 is a much better
default number than 10 due to the exponential growth when doubling the array on resize. It
may take a bit more memory, but it gets a LOT more speed.

- Mark

> Make CachingTokenFilter faster
> ------------------------------
>                 Key: LUCENE-937
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: CachingTokenFilter.patch
> The wrong data structure was used for the CachingTokenFilter. It should be an ArrayList
rather than a LinkedList. There is a noticeable difference in speed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message