apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Erenkrantz <jerenkra...@ebuilt.com>
Subject Web archival of mailing lists
Date Sun, 22 Apr 2001 03:09:09 GMT
Hi all,

I believe that Roy has mentioned to some of you that I've been
working on a module that will process mbox archives and display
it in a nice format on the web with some other cool features.

Well, I think that we are at a stage where we would like some feedback
from the Apache community.  It has progressed enough where I think it is 
stable and feature-complete.  Everyone I have shown it to so far has 
given positive feedback.  Now, for the real critics...

You may see mod_mbox in action at:


I currently have the entire new-httpd and apr-dev archives on there.
Note that this month's archive of both these lists is from a few days 

I also have ht://Dig running which should allow searching of the 
archives.  Please feel free to hammer the box.  I'm not exactly
sure how efficient ht://Dig is, but it seems to work reasonably
well (the search databases are big too large for my taste though).

The current snapshot of the mod_mbox code is on the website.  mod_mbox
is an Apache-2.0 module.  The indexing programs use only APR.  Note 
that I do not currently have access to Win32 platforms - it may not 
compile on there, but I doubt that there is anything too platform 
specific - it is all based on APR.  I have tested this on Linux, 
FreeBSD, and Solaris.

You take your mbox file and generate the index (see the provided 
generate_index.c file).  This creates all of the DBMs necessary for 
mod_mbox.  Simply add "AddHandler .mbox mbox-file" to your httpd.conf
(or other mechanisms that acheive the same goal of setting the handler
to be either mbox-file or mbox-handler) and you are up with mod_mbox.  
Due to the current build system, it is not particularly 
straight-forward to build an external module with dependent objects.  
I have tried to include enough "hints" in the tarball to provide
guidelines as to building mod_mbox from the source.  I don't intend
for what is on apachelabs.org to be a "release," but rather a 

mod_mbox has the advantage over MHonArc in that it will only index
the mbox file when you explicitly tell it to (use the generate_index
program) rather then when a new message is delivered.  Here at eBuilt,
we've had to alter our internal mailing-list archival strategy to
compensate for the fact that MHonArc can not handle large lists
well.  Ideally, mod_mbox scales better.   generate_index on a 750MB
mbox file takes about two or three minutes (Sun U5/360).  The only
storage explictly required for mod_mbox is the DBMs.  And, with
such a high-traffic list, you can run the index a few times a day 
rather than when each new message is delivered.  

I do believe that Roy intends to check mod_mbox into the httpd-2.0
and apr-util trees so that it becomes part of the standard Apache
distribution.  Since I don't have commit access, please don't discuss 
the merits of mod_mbox's inclusion with me (I'm biased anyway).  =-)  
I do think a lot of sites would find this incredibly useful - in my 
opinion, apache.org is number one on this list.

Note that we intend to convert parts of the display logic to filters, 
but that really shouldn't affect the majority of the mbox code and what 
it displays (just how).  I think this is a good time to gauge feedback of 
what we have so far.

Now, to provide an overview of the mod_mbox module (functionally and

There are two real components to mod_mbox.  The first is mod_mbox.c
which is the actual Apache module.  Currently, there is not much to
this file - it is basically a wrapper around the other files.  This
file handles the displaying of the actual message.  mod_mbox is
intended to be a handler ("mbox-file" and "mbox-handler") and
produces a "virtual namespace" from which the user can browse in.

There are two main URIs of interest for each mbox:


The default index is sorted by date, and the threading index is
sorted by date as well.  (I'll explain how the threading works 
later.)  The indexes provide links based on the message-id into the 
mbox file of the format:


All of the other files constitute the core of the mbox functionality
(parsing, threading, sorting, etc.).  My intention is that these could
be placed within apr-util.  mod_mbox uses DBMs to "cache" all of the 
relevant information about the mbox (date, subject, from, references, 
offset within the original file, etc.).  This makes the display of the 
index and retrieval of a message fairly efficient while retaining the 
original archive.

Note that I have only tried it with the SDBM included in apr-util - I 
imagine that it'd work with Sleepycat DB and GDBM (apr-dbm has hooks 
for these, but part of this project was to test out the 
httpd/apr/apr-util code).  

The other key functionality is the threading algorithm.  I based
my threading implementation off of Jamie Zawinski's mail threading
algorithms (he wrote the original versions of Netscape Mail - see 
http://www.jwz.org/doc/threading.html).  His key point was not to 
store the threading tree in the database, but generate the tree on 
the fly.  It has proved to be very efficient and highly accurate.  

Note that I did not use any of his code - I only used his description 
of the algorithm.  This portion of the code is quite complex (although 
I wrote it in a span of 24 hours).  I have managed to test it with 
threads I know (with our internal mailing lists) and it seems reasonably 
accurate.  Subtle bugs may still exist.  If you find a bug, any help 
tracking these down would be greatly appreciated.

For the rest of the implementation details, please see the source code.  
Open source is nice that way.

I look forward to hearing any comments or suggestions ya'll might have.

Thanks in advance,
Justin Erenkrantz

View raw message