Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cocoon-dev@xml.apache.org
Reply-To: <bloritsch@apache.org>
From: "Berin Loritsch" <bloritsch@apache.org>
To: <cocoon-dev@xml.apache.org>
Subject: RE: [RT]: Calculating the cache key
Date: Tue, 21 May 2002 09:33:49 -0400
Message-ID: <001701c200cc$2a160610$ac00a8c0@Gabriel>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
In-Reply-To: <GMEBIBHGAOFGJCDPJANDCEBOEGAA.cziegeler@s-und-n.de>
Importance: Normal

> From: Carsten Ziegeler [mailto:cziegeler@s-und-n.de] 
> 
> Berin Loritsch wrote:

<snip/>

> > 
> > 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'
> > 
> > I'll let you do the math yourself.
> 
> Yes, but what if my strings are, let's say 1000 characters 
> long? And looking at the code for example in the xslt 
> transformer or some other places this is a realistic value.

My point exactly.  The longer the string, the longer it takes
to compute the hash.  The longer it takes to compare the strings
exactly (string.equals()).  You get the picture.

Now, out of that 1000 characters, how many actual dirivations
do you get?

Extending the length of the key merely to ensure unique hash
values is a hack.  Eventually you will get collisions.  The
real nead is for a _quick_ validation, and a unique key.


<snip/>

> The current caching algorithm uses this compound key with the 
> exception that the longs are used to build a string:
> 
> "FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransform
> er:fb02542e;
> HTMLSerializer:1"

Yes, but all that information can be expressed in the integers
themselves.  The strings are useful for debugging purposes, but
the longer they get the more they hurt performance, and the more
they invalidate the value of caching.

The Integer to String conversion is also expensive.  We are doing
things in strings that would be better done with numbers.

> 
> This string can than simply be used for comparissions etc.

It is quicker to match 4 ints than 1000 chars.

> 
> The string approach was used as the algorithm does not know 
> how many components the key will have beforehand. So the 
> array for holding all 
> the components has to be build dynamically, which decreases 
> performance a little bit.

Hense the compound key approach.  A new entry in the compound
key for each component.  The component can keep track of what
the actual entries for it mean.  The compound key does not have
to be limited to only 4 entries.  Just like a string but faster.

> 
> Of course, I see your point, that comparing a (hashed) long 
> value is much faster than comparing strings. 
> 
> But to build the hash value a lot of string operations are 
> used anyway and the hashing takes time. And all this has to 
> be done to compare two 
> keys.
> So is it really still faster than simply concatenating 
> strings and comparing them?
> 
> I really don't know - perhaps we could write a performance 
> test case and see what's faster.

By using the compound key (a small list of longs or ints), the
Components themselves have a unique number--combined with the
mapping of the values themselves to the integer.

Let's say we have the FileGenerator (most likely to have the
highest number of unique resources).

It is quicker to match

"hello.xml"

than it is to match

"FileGenerator:fb0234ad"

It might even be quicker to store the filenames in a list, and
use the index of the entry as part of the unique key.

But testing will bear this out.


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org