cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: Aha! got it! 64k limit(was: new version of the sql logicsheet under development)
Date Thu, 31 Aug 2000 14:36:17 GMT
Robin Green wrote:
> 
> Stefano Mazzocchi <stefano@apache.org> wrote:
> >>Robin wrote:
> > > > Even if you're in C2 and using SAX, the overhead of SAX->DOM->SAX
is
> > > > probably less than SAX->filesystem->SAX. Just pass in an array or
> >Vector of
> > > > literal XML fragments as DOM objects (Elements, DocumentFragments,
> > > > TextNodes, and/or Attributes) to a one-time initialization method in
> >the
> > > > XSPPage class, which then stores them as a field (array or Vector).
> >This is
> > > > thread safe because it is only called once, before first execution.
> >Then the
> > > > populateDocument method can use <xsp:expr>-type code to insert these
> >DOM
> > > > objects directly into the output (cloning them first).
> >
> >Sorry but I don't get it.
> 
> I'll have to write some demonstration code. At some point...
> 
> > > Where would we store such a DOM pool? Wouldn't we need to serialize it
> > > somehow?
> >
> >Exactly.
> 
> In the instance of the generated XSPPage class!

Ok, starting to get it. Doesn't this create memory problems? Keep in
mind we could have tons of XSP pages and DOMs are not known to be
exactly memory efficient.

> > > Sounds cool!
> >
> >No, it doesn't (at least to me)
> >
> >The whole purpose of compiled server pages is to reduce request time
> >interpretation overhead.
> >
> >What you are doing is translating XSP into something much more
> >digestible (I grant you that) and probably much faster than other
> >approaches, but still require interpretation.
> >
> >I would propose to compile the XML into a serialized form of SAX events,
> >store pointers to the file and run them from the XSP code.
> 
> But your "serialized form" also requires interpretation!! You have
> filesystem and OS call overhead, I don't.

I do have filesystem and OS overhead if I use the disk to keep this
information.... if I cache this in memory, I have the exact same
overhead and memory usage as you have, but I win because you have to
crawl the DOM tree to produce SAX events, I don't since the SAX event
are already precompiled in that event flow.

> >Or another approach is to just serialize strings and use
> >randomAccessStreams to get the string out...
> 
> But strings are not the problem. Remember strings are never inlined directly
> into bytecode. See the JVM spec.

Ok, got it.

> >hmmm, this would solve the
> >new String() encoding slowness since we could compile the strings
> >directly into Unicode or something...
> 
> That's true - useful to bear in mind for serialization generally, even if we
> don't use this approach for solving the 64k problem.

No correct, but by having a preencoded and precompiled SAX event stream
(either in memory or disk) I'm sure we can save both the 64Kb method
problem, the Unicode encoding of new String() [note that SAX uses char
arrays exactly for that] and the DOM memory overhead and the DOM
crawling time, at the expense of parsing token separated value tables
(or, even better, binary streams with precalculated array lengths).

NOTE: it might turn out that iterating over a stream read could be
equivalently fast than having all the SAX event method inlined. This
because such a tight loop clearly becomes an hotspot, while inlined list
of methods are always executed only once and thus hotspot JVM run them
slower.

But I agree we need some code to test the figures.

> Is there a canonical unicode encoding?

Yes, there is. This is why I think we should store all this compiled XML
into a pure binary stream and define our own format to "serialize it"
like this.

> However, in any case... you don't know how your JVM stores strings (JLS
> specifies that JVM implementations can store strings however they want) so
> how do you know which encoding will be best? Measure timings, I suppose -
> but the benefits may vary from VM to VM. E.g. Sun 1.3 uses "compressed
> strings" or something, allegedly.

Good point, but there is only one Unicode, no matter what.

> >Yeah, careful: we are doing this for C2 (I don't care about fixing this
> >bug for C1, there is no point in wasting time doing so)
> 
> Well maybe some of us (Uli) would like to see it fixed soon, in C1! :)

Yeah, well, don't count on my help then :)

> > > We've hit the 64k limit mostly because of the way we inline strings,
> 
> No. Inlined strings are not the problem, I reiterate yet again.
> 
> If I look at Uli's code and find that inlined strings ARE the problem, I
> will eat my hat. ;)

Ok, I trust you.

> > > but if we separate data from code (and if we use Ulrich's common-case
> > > "synthetic" methods), we should be able to raise this limit
> > > considerably.
> >
> >Agreed, and I think my proposed solution could be both relatively easy
> >to implement and good enough performance-wise.
> 
> I have to disagree. Sticking it in a DOM vector would be both easy enough
> and avoid the needless performance hit of reading and writing to disc.

You have to be able to write on disk anyway, otherwise you'd have to
keep everything in memory always or recompile the XML at need and cache
that. Yes, the second alternative could be a solution, but it's clearly
much faster to read it from disk directly than read a file thru a parser
and serialize the SAX events in memory.

> Remember, we wouldn't even _need_ to write .class files to disc if it
> weren't for the fact that javac insists on it - custom ClassLoaders can
> quite happily load from byte[]s.

If you have 10Gb of RAM I would entirely agree... on real life, however,
this is not always the case and if I can avoid compilation after my
cache was flushed, well, I'll do it.

> Okay, I suddenly see a problem with my approach, which is maybe what you
> were getting at - have to reprocess page every time you restart the server,
> if you don't store the serialized XML on disc. But C1 halfway does this
> reprocessing anyway - it reparses the page every time in ProducerFromFile,
> which is very wasteful!

Now you start getting it: of course, it's a waste. This is the very
reason why XSP was rewritten from scratch for C2, because in case you
didn't know, this doesn't happen in C2!!!


So, while I agree that caching everything in memory increases speed
(this is a no brainer), you can't assume you have enough memory to store
everything forever. No way. And since disk space is almost free compared
to RAM (and will be for years to come), I say that we compile both the
code and the XML content in "binary executable forms", which mean
bytecode for class, saxcode for our XML files.

Then, the class must have to "interpret" this saxcode to spit out
content and the instance of this is stored in memory after loaded from
disk (and discarded using normal cache flushing algorithms, which are
already in place)

> Compromise would be to save serialized form in a background thread, but also
> store it as a vector of DOM objects in the XSPPage for faster access.

No need for background threads: it could be a two step approach

1) compile the xsp into Java bytecode and save to disk
2) compile the xsp into SAX bytecode and save to disk

Then the XSP engine

3) receives an instance from the Java classloader (which has internal
caching)
4) receives an instance from the SAX docloader (which has internal
caching)

I'll define in a later email (but on the dev list this time), what I
mean for 

 - SAX bytecode
 - SAX docloader

For now, just enjoy the elegance of the simmetry.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Mime
View raw message