cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Verhage <pe...@ibuildings.nl>
Subject XSP, parsing big XML files
Date Mon, 11 Dec 2000 08:30:08 GMT
I've made an eggdrop script (an eggdrop is an irc bot) that logs
everything what is said on a channel to a logfile in XML. I want to
parse this file with Cocoon to generate some nice statistics like how
many words were said that day etc. etc. The log files gets rotated every
day, one day's log is aproximately 130 Kb in size.

A log file looks something like this:

<?xml version="1.0" encoding="US-ASCII"?>
<xmlog type="irc" version="1.0" date="20001210"
channel="#you-really-want-to-know">
  <join time="2255" ident="" nick="harold"/>
  <signoff time="2256" ident="" nick="harold">Bye bye bye...</signoff>
  <signoff time="2305" ident=""
nick="someone">baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaai</signoff>
  <join time="2310" ident="" nick="sleepZzZz"/>
  <join time="2320" ident="" nick="twins"/>
  <signoff time="2321" ident=""
nick="twins">http://gmorkIRC.marbie.net</signoff>
  <nick time="2324" ident="" nick="Oldnick" newnick="Newnick"/>
  <msg time="2342" ident="" nick="Aname">this would be a message</msg>
  <msg time="2342" ident="" nick="Anothername">This too...</msg>
</xmlog>

As you can understand I've anonymized this log file a bit ;). Normally
there is something between the ident quotes, and i've changed, nicks,
messages, channel name etc. But I hope you got the general idea by this
little example.

To generate stats I was planning to use XSP and the XInclude processor
or Util taglib. But now the trouble starts. Because of the size of the
file Apache JServ timeouts after a while. I could make it timeout later.
But I don't like waiting 5 minutes for one days stats to generate.
That's why I'm asking this question how can I optimize this process.
This is what I'm doing right now.

First I used the Xinclude processor to include one log file in another
XML file which whould have a reference to the proper xml logicsheet.
That worked fine, but you understand I've more then one day, and I don't
want to change this everytime in the file. So I've changed this by using
the util taglib and do an include of the file with one of the util tags
and a request parameter. This worked fortunately, but also gives me some
trouble.

Then I created a very simple logicsheet:
<?xml version="1.0"?>

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsp="http://www.apache.org/1999/XSP/Core"
  xmlns:xmlog="http://www.no-nonsense.org/2001/XMLog"
>

<xsl:template match="xsp:page">
  <xsp:page>
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
    </xsl:copy>
    <xsl:apply-templates/>
  </xsp:page>
</xsl:template>

<xsl:template match="xmlog:data">
  <data>
    <xsp:logic>
      int msg = 0;
    </xsp:logic>

    <xsl:for-each select="xmlog/msg">
      <xsp:logic>
        msg++;
      </xsp:logic>
    </xsl:for-each>

    <xsp:expr>msg</xsp:expr>
  </data>
</xsl:template>


<xsl:template match="@*|*|text()|processing-instruction()">
  <xsl:copy>
    <xsl:apply-templates select="@*|*|text()|processing-instruction()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

Notice the match of xmlog:data, this is because of the inclusing of the
xmlog original xml file in the other file with logicsheet.

So this looks very simple, and it also works for small pages, but for
pages with only 20 log lines it takes already 3 seconds to generate, but
for pages with more files jserv timeouts. So this does not seem so
efficient as I thought it would be.

Another problem exists when I use the util taglib to include the other
file. If I do this, the above example does not work. It seems first my
logicsheets get's parsed, and then the util taglib. Ofcourse this is
logically because the <xmlog:data> tag surround the include tags. This
means there is nothing to parse yet in my logicsheet. There are no xmlog
file tags yet which I can count etc. etc. When I use Xinclude I don't
have this problem (ofcourse) because then the other xml files get's
first compiled within the other page, and then my logicsheet gets
parsed. But when I use xinclude I can't dynamically include log files..
:/

As you can see, this is not optimal. I rather don't increase the timeout
setting of jserv, because I just don't want it to take this time (I
think it could be faster then this). Just to let you know, this is not
ment for directly generating a stats page in real time, it's used to
generate another xml file which I can use later with some nice
stylesheet.

I hope someone knows a way for me to improve the things I mentioned
above. Maybe Cocoon is not a good choice to do this, and I have to write
a program on my own to do this (by using the xerces and xalan java
libraries on my own). But I hope this can be done with Cocoon in an
efficient manner.

Just for the record. I'm using Apache 1.3.14, Cocoon 1.8.1-dev, the
latest Apache-JServ (the last one ever made) on a Pentium Celeron 433
with 128 MB RAM running FreeBSD 4.1.1. There is no X running and almost
no other processes. So Jserv has got almost all of my CPU time and
almost all of my RAM to use. 

With best regards,

Peter

P.S.
Maybe Cocoon 2 would be faster in processing the above?

-- 
Peter Verhage       <peter@ibuildings.nl>
ibuildings.nl BV - information technology
http://www.ibuildings.nl -  0118 41 50 54

Mime
View raw message