cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carsten Ziegeler" <>
Subject RE: [RT]: Calculating the cache key
Date Tue, 21 May 2002 14:15:27 GMT

Berin Loritsch wrote:
> <snip/>
> > > 
> > > 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'
> > > 
> > > I'll let you do the math yourself.
> > 
> > Yes, but what if my strings are, let's say 1000 characters 
> > long? And looking at the code for example in the xslt 
> > transformer or some other places this is a realistic value.
> My point exactly.  The longer the string, the longer it takes
> to compute the hash.  The longer it takes to compare the strings
> exactly (string.equals()).  You get the picture.
> Now, out of that 1000 characters, how many actual dirivations
> do you get?
> Extending the length of the key merely to ensure unique hash
> values is a hack.  Eventually you will get collisions.  The
> real nead is for a _quick_ validation, and a unique key.
No, I meant it the other way round: the string or the information
from which the hash is build can be longer than 1000 characters.
So hashing this will take some time and you will get collisions.
This can either be solved by not using long as the result of
the hash operation or by reducing the information to be hashed.
But I fear, both seem not to be an option.

> <snip/>
> > The current caching algorithm uses this compound key with the 
> > exception that the longs are used to build a string:
> > 
> > "FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransform
> > er:fb02542e;
> > HTMLSerializer:1"
> Yes, but all that information can be expressed in the integers
> themselves.  The strings are useful for debugging purposes, but
> the longer they get the more they hurt performance, and the more
> they invalidate the value of caching.
> The Integer to String conversion is also expensive.  We are doing
> things in strings that would be better done with numbers.
All information used in the caching is at first represented as strings:
the name of the component, the URI, the parameter values etc.

> > 
> > This string can than simply be used for comparissions etc.
> It is quicker to match 4 ints than 1000 chars.
> > 
> > The string approach was used as the algorithm does not know 
> > how many components the key will have beforehand. So the 
> > array for holding all 
> > the components has to be build dynamically, which decreases 
> > performance a little bit.
> Hense the compound key approach.  A new entry in the compound
> key for each component.  The component can keep track of what
> the actual entries for it mean.  The compound key does not have
> to be limited to only 4 entries.  Just like a string but faster.
> > 
> > Of course, I see your point, that comparing a (hashed) long 
> > value is much faster than comparing strings. 
> > 
> > But to build the hash value a lot of string operations are 
> > used anyway and the hashing takes time. And all this has to 
> > be done to compare two 
> > keys.
> > So is it really still faster than simply concatenating 
> > strings and comparing them?
> > 
> > I really don't know - perhaps we could write a performance 
> > test case and see what's faster.
> By using the compound key (a small list of longs or ints), the
> Components themselves have a unique number--combined with the
> mapping of the values themselves to the integer.
> Let's say we have the FileGenerator (most likely to have the
> highest number of unique resources).
> It is quicker to match
> "hello.xml"
> than it is to match
> "FileGenerator:fb0234ad"
Yes, exactly. This is the base idea of my RT. If we are using
the hashed long to build a string, than simply using this
string (hello.xml) should be faster.

> It might even be quicker to store the filenames in a list, and
> use the index of the entry as part of the unique key.
> But testing will bear this out.
Yupp, any volunteers :)


To unsubscribe, e-mail:
For additional commands, email:

View raw message