> From: Carsten Ziegeler [mailto:cziegeler@s-und-n.de] > > Hi, > > with the new source resolving (from excalibur) we can > implement a better caching mechanism (I mentioned this > several times in the last weeks). > > I'm just wondering, how the cache key should be generated. > > The current solution (Interface Cacheable) uses a long which > is in 99,9% a hash value of the URI (with some additional > information sometimes). This key is unique within the space > of the component > (e.g. file generator, xslt transformer etc). > So the current implementation is nearly always the same: > - Creating a string buffer > - Appending all information > - Hashing the string buffer I'm wondering about the whole hash algorithm. Concatenating strings can get expensive, especially if it is done in multiple sections mutliple times. The fact is that hash values as a long provide a virtually limitless number of possibilities for cache values in a site. With 2^32 (or over 4 billion unique entries) you are not likely to ever need to repeat. Here is the algorythm for a string hash: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] Your possible hash values are limited by the values that can exist in a string. The shorter your string the fewer possible values. The fact that they use 31 as a multiplier is only so that they can spread the hash values better (multiples of 2 are even more limiting). Using the formula, you can translate this: "http://localhost:8080/cocoon/hello.html" into this: 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l' I'll let you do the math yourself. It also means that computing the hash key takes a proportionate amount of time based on the length of the string. What makes it worse is the fact that the unique part of the string is only after the "http://localhost:8080/cocoon/" section. We are doing more work for less gain. All a hashCode is supposed to buy us is a quick way to get to the information we need. A way to get us to the "bucket" if you will, which if there are collisions we need to iterate through the entries in the bucket until we find our match. This takes even longer when we are dealing with strings. What we need is a way of either generating a match based on a GUID (Globally Unique IDentifier), or create a compound key that has a group of id's generated from each component. The compound key approach has a drawback in that in order to match a key, you have to iterate through all the parts of the key. However you can easily generate a hash value from the parts, and there are fewer parts. Consider a compound key that has an int for each part. That integer is made up of the top half marking the component ID (there will be less than 65,536 unique types of components), and the unique entry within that component (another 65,536 entries). We can shift the balance as necessary, but keep up. A compound key has one part for each component that is part of the pipeline. Something like this: p[0]=fe01b342 { FileGenerator, resource="hello.xml" } p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" } p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" } p[3]=fc010000 { HtmlSerializer } This is a compound key with only four entries (much quicker to iterate through the entries than a really long string). The Hash algorithm is an exercise I can let people far more intelligent than I to come up with. Basically it should be driven by the four entries. Perhaps something along these lines: s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n) I'm no genious, but I think it is better than working with strings if we really don't have to. --------------------------------------------------------------------- To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org For additional commands, email: cocoon-dev-help@xml.apache.org