cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vadim Gritsenko" <vadim.gritse...@verizon.net>
Subject RE: [RT]: Calculating the cache key
Date Wed, 22 May 2002 12:39:22 GMT
> From: Geoff Howard [mailto:ghoward@crosswalk.com]
> 
> The performance of key comparison is especially important for high
traffic
> sites (like ours).  An idea I've had to decrease this importance is:
> 
> An *option* to have the pipeline do asynchron cache validation - if a
cached
> item exists, use it *then* check it's validity (or cause it to be
checked)

This (asynchron cache validation) won't help you speed up cache key
manipulations. To get cached object validity, you must first create (and
calculate hash value of) the key, only then you can kick off
asynchronous validation and serve content from the cache (if present).


> and remove it from the store if it was expired.  This is attractive to
us
> because we'll have many steps that will need to be checked (many of
our
> pages will involve 5 or 6 or more generators aggregated and included)
and
> the performance hit of generating and comparing all those keys is
scary.

You are right thinking that key generation in this scenario takes lots
of time. If you comment out toString() method in the
AggregatedCacheValidity, you could get as much as 50% speed improvement
for cached results - which means that yes, string manipulations take
significant amount of time. 

Berin's point is valid, fast cache will requires use of compound integer
or long keys.


Vadim


> This integrates well with the external cache validity concept that
Marcelo
> Ochoa/DBPrism has been working on.
> 
> Geoff Howard
> 
> -----Original Message-----
> From: Carsten Ziegeler [mailto:cziegeler@s-und-n.de]
> Sent: Tuesday, May 21, 2002 9:16 AM
> To: cocoon-dev@xml.apache.org
> Subject: RE: [RT]: Calculating the cache key
> 
> 
> 
> Berin Loritsch wrote:
> >
> > <snip/>
> >
> > I'm wondering about the whole hash algorithm.  Concatenating
> > strings can get expensive, especially if it is done in multiple
> > sections mutliple times.  The fact is that hash values as a long
> > provide a virtually limitless number of possibilities for cache
> > values in a site.  With 2^32 (or over 4 billion unique entries)
> > you are not likely to ever need to repeat.
> >
> > Here is the algorythm for a string hash:
> >
> > s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
> >
> >
> > Your possible hash values are limited by the values that can
> > exist in a string.  The shorter your string the fewer possible
> > values.  The fact that they use 31 as a multiplier is only so
> > that they can spread the hash values better (multiples of 2
> > are even more limiting).
> >
> > Using the formula, you can translate this:
> >
> > "http://localhost:8080/cocoon/hello.html"
> >
> > into this:
> >
> > 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'
> >
> > I'll let you do the math yourself.
> 
> Yes, but what if my strings are, let's say 1000 characters long?
> And looking at the code for example in the xslt transformer or
> some other places this is a realistic value.
> 
> >
> > It also means that computing the hash key takes a proportionate
> > amount of time based on the length of the string.  What makes it
> > worse is the fact that the unique part of the string is only after
> > the "http://localhost:8080/cocoon/" section.  We are doing more
> > work for less gain.
> >
> > All a hashCode is supposed to buy us is a quick way to get to the
> > information we need.  A way to get us to the "bucket" if you will,
> > which if there are collisions we need to iterate through the entries
> > in the bucket until we find our match.  This takes even longer when
> > we are dealing with strings.
> >
> > What we need is a way of either generating a match based on a GUID
> > (Globally Unique IDentifier), or create a compound key that has
> > a group of id's generated from each component.  The compound key
> > approach has a drawback in that in order to match a key, you have to
> > iterate through all the parts of the key.  However you can easily
> > generate a hash value from the parts, and there are fewer parts.
> >
> > Consider a compound key that has an int for each part.  That integer
> > is made up of the top half marking the component ID (there will be
less
> > than 65,536 unique types of components), and the unique entry within
> > that component (another 65,536 entries).  We can shift the balance
> > as necessary, but keep up.  A compound key has one part for each
> > component that is part of the pipeline.  Something like this:
> >
> > p[0]=fe01b342 { FileGenerator, resource="hello.xml" }
> > p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" }
> > p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" }
> > p[3]=fc010000 { HtmlSerializer }
> >
> > This is a compound key with only four entries (much quicker to
iterate
> > through the entries than a really long string).  The Hash algorithm
is
> > an exercise I can let people far more intelligent than I to come up
> > with.
> > Basically it should be driven by the four entries.
> >
> > Perhaps something along these lines:
> >
> > s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n)
> >
> > I'm no genious, but I think it is better than working with strings
if
> > we really don't have to.
> >
> The current caching algorithm uses this compound key with the
exception
> that the longs are used to build a string:
> 
>
"FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransformer:fb02542
e;
> HTMLSerializer:1"
> 
> This string can than simply be used for comparissions etc.
> 
> The string approach was used as the algorithm does not know how many
> components the key will have beforehand. So the array for holding all
> the components has to be build dynamically, which decreases
performance
> a little bit.
> 
> Of course, I see your point, that comparing a (hashed) long value is
much
> faster than comparing strings.
> 
> But to build the hash value a lot of string operations are used anyway
> and the hashing takes time. And all this has to be done to compare two
> keys.
> So is it really still faster than simply concatenating strings and
> comparing them?
> 
> I really don't know - perhaps we could write a performance test case
> and see what's faster.
> 
> Carsten


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message