cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoff Howard <ghow...@crosswalk.com>
Subject RE: [RT]: Calculating the cache key
Date Tue, 21 May 2002 15:10:45 GMT
The performance of key comparison is especially important for high traffic
sites (like ours).  An idea I've had to decrease this importance is:

An *option* to have the pipeline do asynchron cache validation - if a cached
item exists, use it *then* check it's validity (or cause it to be checked)
and remove it from the store if it was expired.  This is attractive to us
because we'll have many steps that will need to be checked (many of our
pages will involve 5 or 6 or more generators aggregated and included) and
the performance hit of generating and comparing all those keys is scary.
This integrates well with the external cache validity concept that Marcelo
Ochoa/DBPrism has been working on.
 
Geoff Howard

-----Original Message-----
From: Carsten Ziegeler [mailto:cziegeler@s-und-n.de]
Sent: Tuesday, May 21, 2002 9:16 AM
To: cocoon-dev@xml.apache.org
Subject: RE: [RT]: Calculating the cache key



Berin Loritsch wrote:
> 
> <snip/>
> 
> I'm wondering about the whole hash algorithm.  Concatenating
> strings can get expensive, especially if it is done in multiple
> sections mutliple times.  The fact is that hash values as a long
> provide a virtually limitless number of possibilities for cache
> values in a site.  With 2^32 (or over 4 billion unique entries)
> you are not likely to ever need to repeat.
> 
> Here is the algorythm for a string hash:
> 
> s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
> 
> 
> Your possible hash values are limited by the values that can
> exist in a string.  The shorter your string the fewer possible
> values.  The fact that they use 31 as a multiplier is only so
> that they can spread the hash values better (multiples of 2
> are even more limiting).
> 
> Using the formula, you can translate this:
> 
> "http://localhost:8080/cocoon/hello.html"
> 
> into this:
> 
> 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'
> 
> I'll let you do the math yourself.

Yes, but what if my strings are, let's say 1000 characters long?
And looking at the code for example in the xslt transformer or
some other places this is a realistic value.

> 
> It also means that computing the hash key takes a proportionate
> amount of time based on the length of the string.  What makes it
> worse is the fact that the unique part of the string is only after
> the "http://localhost:8080/cocoon/" section.  We are doing more
> work for less gain.
> 
> All a hashCode is supposed to buy us is a quick way to get to the
> information we need.  A way to get us to the "bucket" if you will,
> which if there are collisions we need to iterate through the entries
> in the bucket until we find our match.  This takes even longer when
> we are dealing with strings.
> 
> What we need is a way of either generating a match based on a GUID
> (Globally Unique IDentifier), or create a compound key that has
> a group of id's generated from each component.  The compound key
> approach has a drawback in that in order to match a key, you have to
> iterate through all the parts of the key.  However you can easily
> generate a hash value from the parts, and there are fewer parts.
> 
> Consider a compound key that has an int for each part.  That integer
> is made up of the top half marking the component ID (there will be less
> than 65,536 unique types of components), and the unique entry within
> that component (another 65,536 entries).  We can shift the balance
> as necessary, but keep up.  A compound key has one part for each
> component that is part of the pipeline.  Something like this:
> 
> p[0]=fe01b342 { FileGenerator, resource="hello.xml" }
> p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" }
> p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" }
> p[3]=fc010000 { HtmlSerializer }
> 
> This is a compound key with only four entries (much quicker to iterate
> through the entries than a really long string).  The Hash algorithm is
> an exercise I can let people far more intelligent than I to come up
> with.
> Basically it should be driven by the four entries.
> 
> Perhaps something along these lines:
> 
> s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n)
> 
> I'm no genious, but I think it is better than working with strings if
> we really don't have to.
> 
The current caching algorithm uses this compound key with the exception
that the longs are used to build a string:

"FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransformer:fb02542e;
HTMLSerializer:1"

This string can than simply be used for comparissions etc.

The string approach was used as the algorithm does not know how many
components the key will have beforehand. So the array for holding all 
the components has to be build dynamically, which decreases performance
a little bit.

Of course, I see your point, that comparing a (hashed) long value is much
faster than comparing strings. 

But to build the hash value a lot of string operations are used anyway
and the hashing takes time. And all this has to be done to compare two 
keys.
So is it really still faster than simply concatenating strings and
comparing them?

I really don't know - perhaps we could write a performance test case
and see what's faster.

Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message