Return-Path: Delivered-To: apmail-apr-dev-archive@apr.apache.org Received: (qmail 36886 invoked by uid 500); 22 Apr 2001 03:09:33 -0000 Mailing-List: contact dev-help@apr.apache.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Delivered-To: mailing list dev@apr.apache.org Received: (qmail 36870 invoked from network); 22 Apr 2001 03:09:32 -0000 Date: Sat, 21 Apr 2001 20:09:09 -0700 From: Justin Erenkrantz To: new-httpd@apache.org, dev@apr.apache.org Subject: Web archival of mailing lists Message-ID: <20010421200909.G25098@ebuilt.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i X-AntiVirus: scanned for viruses by AMaViS 0.2.1-pre3 (http://amavis.org/) X-Spam-Rating: h31.sny.collab.net 1.6.2 0/1000/N Hi all, I believe that Roy has mentioned to some of you that I've been working on a module that will process mbox archives and display it in a nice format on the web with some other cool features. Well, I think that we are at a stage where we would like some feedback from the Apache community. It has progressed enough where I think it is stable and feature-complete. Everyone I have shown it to so far has given positive feedback. Now, for the real critics... You may see mod_mbox in action at: http://www.apachelabs.org/ I currently have the entire new-httpd and apr-dev archives on there. Note that this month's archive of both these lists is from a few days ago. I also have ht://Dig running which should allow searching of the archives. Please feel free to hammer the box. I'm not exactly sure how efficient ht://Dig is, but it seems to work reasonably well (the search databases are big too large for my taste though). The current snapshot of the mod_mbox code is on the website. mod_mbox is an Apache-2.0 module. The indexing programs use only APR. Note that I do not currently have access to Win32 platforms - it may not compile on there, but I doubt that there is anything too platform specific - it is all based on APR. I have tested this on Linux, FreeBSD, and Solaris. You take your mbox file and generate the index (see the provided generate_index.c file). This creates all of the DBMs necessary for mod_mbox. Simply add "AddHandler .mbox mbox-file" to your httpd.conf (or other mechanisms that acheive the same goal of setting the handler to be either mbox-file or mbox-handler) and you are up with mod_mbox. Due to the current build system, it is not particularly straight-forward to build an external module with dependent objects. I have tried to include enough "hints" in the tarball to provide guidelines as to building mod_mbox from the source. I don't intend for what is on apachelabs.org to be a "release," but rather a "snapshot." mod_mbox has the advantage over MHonArc in that it will only index the mbox file when you explicitly tell it to (use the generate_index program) rather then when a new message is delivered. Here at eBuilt, we've had to alter our internal mailing-list archival strategy to compensate for the fact that MHonArc can not handle large lists well. Ideally, mod_mbox scales better. generate_index on a 750MB mbox file takes about two or three minutes (Sun U5/360). The only storage explictly required for mod_mbox is the DBMs. And, with such a high-traffic list, you can run the index a few times a day rather than when each new message is delivered. I do believe that Roy intends to check mod_mbox into the httpd-2.0 and apr-util trees so that it becomes part of the standard Apache distribution. Since I don't have commit access, please don't discuss the merits of mod_mbox's inclusion with me (I'm biased anyway). =-) I do think a lot of sites would find this incredibly useful - in my opinion, apache.org is number one on this list. Note that we intend to convert parts of the display logic to filters, but that really shouldn't affect the majority of the mbox code and what it displays (just how). I think this is a good time to gauge feedback of what we have so far. Now, to provide an overview of the mod_mbox module (functionally and architecturally): There are two real components to mod_mbox. The first is mod_mbox.c which is the actual Apache module. Currently, there is not much to this file - it is basically a wrapper around the other files. This file handles the displaying of the actual message. mod_mbox is intended to be a handler ("mbox-file" and "mbox-handler") and produces a "virtual namespace" from which the user can browse in. There are two main URIs of interest for each mbox: http://foo.example.com/your.mbox/index.html http://foo.example.com/your.mbox/threads.html The default index is sorted by date, and the threading index is sorted by date as well. (I'll explain how the threading works later.) The indexes provide links based on the message-id into the mbox file of the format: http://foo.example.com/your.mbox/message-id All of the other files constitute the core of the mbox functionality (parsing, threading, sorting, etc.). My intention is that these could be placed within apr-util. mod_mbox uses DBMs to "cache" all of the relevant information about the mbox (date, subject, from, references, offset within the original file, etc.). This makes the display of the index and retrieval of a message fairly efficient while retaining the original archive. Note that I have only tried it with the SDBM included in apr-util - I imagine that it'd work with Sleepycat DB and GDBM (apr-dbm has hooks for these, but part of this project was to test out the httpd/apr/apr-util code). The other key functionality is the threading algorithm. I based my threading implementation off of Jamie Zawinski's mail threading algorithms (he wrote the original versions of Netscape Mail - see http://www.jwz.org/doc/threading.html). His key point was not to store the threading tree in the database, but generate the tree on the fly. It has proved to be very efficient and highly accurate. Note that I did not use any of his code - I only used his description of the algorithm. This portion of the code is quite complex (although I wrote it in a span of 24 hours). I have managed to test it with threads I know (with our internal mailing lists) and it seems reasonably accurate. Subtle bugs may still exist. If you find a bug, any help tracking these down would be greatly appreciated. For the rest of the implementation details, please see the source code. Open source is nice that way. I look forward to hearing any comments or suggestions ya'll might have. Thanks in advance, Justin Erenkrantz jerenkrantz@ebuilt.com