commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Dudgeon <>
Subject Re: [PIPELINE] Questions about pipeline
Date Tue, 28 Oct 2008 12:10:36 GMT
Ken Tanaka wrote:
> Hi Tim,
> Tim Dudgeon wrote:
>> Hi Ken,
>> Thanks for the rapid response.
>> First, let me explain some background here.
>> I am looking for Java based pipelining solutions to incorporate into 
>> an exisiting application. The use of pipelining is well established in 
>> the  sector, with applications like Pipeline Pilot and Knime, and so 
>> many of the common needs have been well established over several years 
>> by these applciations.
> Have you also looked at Pentaho?
I took a look, but it doesn't seem to be what I'm after.

>> Key issues that my initial investigations of Jakarta Pipeline seem to 
>> identify are:
>> 1. Branching is very common. This typically takes 2 forms:
>> 1.1. Splitting data. A stage could (for instance) have 2 output ports, 
>> "pass" and "fail". Data is processed by the stage and sent to 
>> whichever port is appropriate. Different stages would be attached to 
>> each port, resulting in the pipeline being brached by this pass/fail 
>> decision.
>> 1.2. Attaching multiple stages to a particular output port.
>> The stage just sends its output onwards. It has no interest in what 
>> happens once the data is sent, and is not concerned whether zero, one 
>> or  100 stages receive the output. This is the stage1,2,3,4 scenario I 
>> outlined previously.
>> 2. Merging is also common (though less common than branching).
>> By analogy with braching, I would see this conceptually as a stage 
>> having multiple input ports (A and B in the merging example).
> At present, the structure for storing stages is a linked list, and 
> branches are implemented as additional pipelines accessed by a name 
> through a HashMap. To generally handle branching and merging, a directed 
> acyclic graph (DAG) would better serve, but that would require the 
> pipeline code to be rewritten at this level. Arguments could also be 
> made for allowing cycles, as in directed graphs, but that would be 
> harder to debug, and with a GUI might be a step toward a visual 
> programming language--so I don't think this should be pursued yet unless 
> there are volunteers...

I agree, DAG would be better, but cycles could be needeed too, so DG 
would be better too.
But, yes, I am ideally wanting visual designer too.

>> Taken together I can see a generalisation here using named ports 
>> (input and outut), which is similar, but not identical, to your 
>> current concept of branches.
>> So you have:
>> BaseStage.emit(String branch, Object obj);
>> whereas I would conceptually see this as:
>> emit(String port, Object obj);
>> and you have:
>> Stage.process(Object obj);
>> whereas I would would conceptually see this as:
>> Stage.process(String port, Object obj);
>> And when a pipeline is being assembled a downstream stage is attached 
>> to a particular port of a stage, not the stage itself. It then just 
>> recieves data sent to that particular port, but not the other ports.
> I could see that this would work, but would need either modifying a 
> number of stages already written, or maybe creating a compatibility 
> stage driver that takes older style stages so that the input object 
> comes from a configured port name, usually "input" and a sends the 
> output to  configured output ports named "output" and whatever the 
> previous branch name(s) were, if any. Stages that used to look for 
> events for input should be rewritten to read multiple inputs ( 
> Stage.process(String port, Object obj) as you suggested). Events would 
> then be reserved for truly out-of-band signals between stages rather 
> than carrying data for processing.

Agreed, I think with would be good. I think existing stages could be 
made compatible by having a default input and output port, and to use 
those if not specific port was specified.
A default in/out port would probably be necessary to allow simple 

>> I'd love to hear how compatible the current system is with this way of 
>> seeing things. Are we just talking about a new type of Stage 
>> implementation, or a more fundamental incompatibility at the API level.
> I think you have some good ideas. This is changing the Stage 
> implementation, which affects on the order of 60 stages for us that 
> override the process method, unless the compatibility stage driver works 
> out. The top level pipeline would also be restructured. The amount of 
> work required puts this out of the near term for me to work on it, but 
> there may be other developers/contributors to take this on.

I need to investigate more fully here, and consider the other options.
But potentially this is certainly of interest.

So is all that's necessary to prototype this to create a new Stage 
implementation, with new emit( ... ) and process( ... ) methods?



> -Ken

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message