Mailing-List: contact users-help@activemq.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@activemq.apache.org
Received-SPF: pass (nike.apache.org: domain of james.strachan@gmail.com
 designates 64.233.178.240 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=TTA5RtOALDO8CjfcFl0bVJ4MHICbudTzBC0t+FnSmZ5bsM999rI/R0DNm2Wu5GRVYoQqvgHDDoEk3nJ3qp02M2BeQx+pRDFTTWyGNBKP+Asn95DTOt+BSReiFDNdAaDMhBor9NpTcrtyHMjCS2ZKzdn4eLCiMB0ziC74CE8rsY4=
Message-ID: <ec6e67fd0805082355w1e514d84y236cdb994ebb9abd@mail.gmail.com>
Date: Fri, 9 May 2008 07:55:58 +0100
From: "James Strachan" <james.strachan@gmail.com>
To: users@activemq.apache.org
Subject: Re: SMTP Server (Apache James) spooling hints
In-Reply-To: <19113841.125261210236974250.JavaMail.root@elysia.void.it>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <19113841.125261210236974250.JavaMail.root@elysia.void.it>

2008/5/8 Stefano Bagnara <apache@bago.org>:
> Hi all,
>
>  I'm an Apache JAMES committer and I'm "almost" new to ActiveMQ.

Welcome :)

>  I'm starting analysis on how to replace our default spool with ActiveMQ and
> I hope you can give me some hints :-)
>  It would be better to use ActiveMQ via JMS (more flexibility) but if there
> is any better solution to our problems by using specific ActiveMQ APIs then
> why not!!

I'd be tempted to use the JMS API as (i) you can if you ever need to
switch JMS providers and (ii) lots of the internal APIs to things like
data stores & transaction logs and the like do change over time.
Though maybe Camel is even easier (more in this later...)


>  Our scenario is an SMTP Server so we have something like this:
>
>  1) SMTP Server receives messages and put them to the spool. The spool have
> to be persistent because once the message has been posted via SMTP we cannot
> loose it. Most time the message will be consumed very fast, so in past I
> looked at using Kaha directly for this, but maybe the 5.0  AMQ Message Store
> already handle this one in a performant way?

Yeah - I'd use the default persistence engine in ActiveMQ 5.x, the AMQ
Store which is very fast...
http://activemq.apache.org/amq-message-store.html

basically just use the out-of-the-box config :)


>  2) Our current spooling have this architecture:
>  we have a single "spool" that contains messages with a "state". We read a
> random message from the spool, look at its state and then start the
> processing depending on the state itself at the end of the processing we can
> alter the state and leave the message in the spool, or we can remove it from
> the spool. In the processing we could even push more messages into the spool
> (e.g: to split the message to 2 different paths). ATM the re is no
> transaction management.
>  The processing from a state to another (or to delete) is a sequence of
> micro-processings (named matchers/mailets in james), so the actual status
> depends also on what matchers/mailets have been processed so far, but we
> currently keep this in memory and never store this. So if something goes
> wrong (given that we don't have transactions) we simply start from the
> beginning of that "state processor" (I'd like to improve this issue, too,
> with the new ActiveMQ based spool).

Using transactions is a good idea; then you can atomically process a
number of messages and they are either processed or not in an ACID
way. To improve performance you might wanna use batches; say
processing 1000 messages in a single transaction; which means that
most of the operations are all asynchronous & fast other than the
transaction commit which does a sync-to-disk.
http://activemq.apache.org/should-i-use-transactions.html


>  Some times the message is simply moved from one state to another a few
> times and then it is removed from the spool because of 2 causes:
>  a) it has been moved to the "outgoing spool" (the spool for the messages to
> be sent to other smtp servers)
>  b) it has been posted to an user inbox.
>  Other times the message is altered in its content.
>  So you see in James we currently have a single "message store" and we can
> "lock on a message" (so no other thread will take it) "retrieve it", "update
> and unlock it" (alter its state or state+content) or "remove it". How would
> you manage this with ActiveMQ?

With ActiveMQ you'd use a queue per state/maillet, remove it from the
queue, do something with it then put it on some other queue(s) (either
changed or the same message). The simple JMS/MOM model of sending to a
queue or consuming from a queue turns out to be very fast; allowing a
highly SEDA based asynchronous model to go really fast since there's
no locking or leasing required - and messages can flow very
asynchronously to boost throughput.

If you do find you wanna grab - edit - put back type thing alot you
could look at using JavaSpaces (or Entity Bean :). But I think for
JAMES then messaging could work well as it sounds to me (as a newbie
JAMES person) like what you're doing processing mail is kinda a pipes
and filters type model...
http://activemq.apache.org/camel/pipes-and-filters.html

which maps very well to messaging and queues.

For more background see :
http://activemq.apache.org/camel/enterprise-integration-patterns.html

btw you could maybe use Camel to describe how mail is routed from
JAMES to different maillets & queues? Then you wouldn't have to worry
about learning the JMS API (and we could switch to different spool
implementations later on if need be). It'd also then make it easier to
decide when to use queues. e.g. you might have 5 mailets; you could
put each one of them on a queue; or rather than 5 writes to a queue
you could invoke all 5 maillets in one go (in the same transaction) -
or something in between.


>  3) Outgoing spool:
>  The outgoing spool in JAMES is a spool like the main spool, with the
> difference that a message delivery could fail and there is a retry schedule.
> So we try to send a message, on failure we try again 10 minutes later, then
> 30 minutes later, then 2 hours later (it is configurable) and so on. ATM we
> store the "next-attempt-date" and then each "deliverer" simply take the
> message with the minor next-attempt-date and if it is due for delivery it
> starts its work, otherwise it will simply wait the needed time (one
> deliverer is noticed when a *new* message enter this spool / They all "wait"
> on the spool and the spool is noticed one at each store).
>  The most common case is:
>  a) the message we received at #1 entered the spool #2 and is processed very
> fast and it ends in the outgoing spool #3 where it is delivered on the first
> attempt. In this case it would be cool if the message was in memory and
> simply written once for safety because the processing should be fast and it
> would be slow to read it again from the disk.
>  b) we fail our first attempt, then it does not make sense to keep it in
> memory because we know we won't need it in the next X minutes/hours.
>  Any suggestions on how to do this with ActiveMQ?

It sounds like you could use the delayer pattern...
http://activemq.apache.org/camel/delayer.html


Then have separate queues for '30 mins later', '1 hour later', '2 hours later'.

If delivery fails you send it to the next queue where messages are
attempted to be delivered in order; but just X mins from the time they
are added to the queue.

Something kinda like this in pseudo camel code...

from("activemq:outout.dispatch.attempt.1").bean(MyDispatchThingy.class);
from("activemq:output.dispatch.attempt.2").delay(thirtyMins).bean(MyDispatchThingy.class);
from("activemq:output.dispatch.attempt.3").delay(oneHour).bean(MyDispatchThingy.class);
from("activemq:output.dispatch.attempt.4").delay(twoHours).bean(MyDispatchThingy.class);

Then we'd just need to use the try/catch mechanism or a custom ErrorHandler
http://activemq.apache.org/camel/error-handler.html

so that if MyDispatchThingy fails to dispatch the message we dispatch
it to the next queue in the list (or delete it if we're on attempt 4
etc).


>  As a last point we have to take care of 2 different use-cases:
>  I) most traffic is done by fastmoving small messages but


The nice thing about the above is that you can then control
concurrency on each one of the attempt queues. So you could have, say,
1000 threads doing attempt1, and 10 threads doing attempt2 and just
one thread doing attempt 3 or 4 etc.


>  II) many messages are 1-10MB in size, and a few message could be even 100MB
> or even more: how should we handle this messages in ActiveMQ given that we
> can't take them in memory but we simply want to stream then in and out from
> the server?

JMS/MOM is designed for relatively modest messages as JMS clients and
brokers try and keep messages around in RAM for maximum caching,
performance and throughput.

So you might wanna implement some kinda mechanism where messages over
a certain size; say over 10MB use BlobMessages - that is to say out of
band payloads...
http://activemq.apache.org/blob-messages.html

so you use JMS/ActiveMQ for the high performance reliable load
balancing across a cluster of boxes; but keep the message payloads on
some file system/JCR etc. Or maybe you try a middle ground where you
keep the message headers in the JMS message but leave the body as a
separate out of band entity; so you could use smart JMS routing using
message headers.


>  I understand this is a lot of questions, but I would really appreciate any
> hint, even partial. I'm collecting ideas :-)

:)

>  Stefano
>
>  PS: we are also evaluating using JCR for inboxes if you was wondering, but
> this is another story, for another list ;-)

You could store the mail in JCR and use messaging for the process
flow. e.g. the JMS messages could just contain a reference (URL?) to
the message payload.

How often is the payload of the message mutated as it goes through
maillets? If it remains kinda static and its more the headers, states
& mailets that change mostly, it could be worth putting the payload in
some file system / REST resource / JCR and just referring to the
payload for large messages (say over 1-10MB)?

If a message has to go through, say, 5 different steps that you might
wanna load balance and cluster using different queues; it'd be painful
to read/write a 100Mb email body for each 5 steps if the payload never
changes through the 5 steps.

-- 
James
-------
http://macstrac.blogspot.com/

Open Source Integration
http://open.iona.com