commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Dudgeon <>
Subject Re: [PIPELINE] Questions about pipeline
Date Sat, 25 Oct 2008 14:04:34 GMT
Hi Ken,

Thanks for the rapid response.
First, let me explain some background here.
I am looking for Java based pipelining solutions to incorporate into an 
exisiting application. The use of pipelining is well established in the 
  sector, with applications like Pipeline Pilot and Knime, and so many 
of the common needs have been well established over several years by 
these applciations.

Key issues that my initial investigations of Jakarta Pipeline seem to 
identify are:

1. Branching is very common. This typically takes 2 forms:
1.1. Splitting data. A stage could (for instance) have 2 output ports, 
"pass" and "fail". Data is processed by the stage and sent to whichever 
port is appropriate. Different stages would be attached to each port, 
resulting in the pipeline being brached by this pass/fail decision.
1.2. Attaching multiple stages to a particular output port.
The stage just sends its output onwards. It has no interest in what 
happens once the data is sent, and is not concerned whether zero, one or 
  100 stages receive the output. This is the stage1,2,3,4 scenario I 
outlined previously.

2. Merging is also common (though less common than branching).
By analogy with braching, I would see this conceptually as a stage 
having multiple input ports (A and B in the merging example).

Taken together I can see a generalisation here using named ports (input 
and outut), which is similar, but not identical, to your current concept 
of branches.

So you have:
BaseStage.emit(String branch, Object obj);
whereas I would conceptually see this as:
emit(String port, Object obj);
and you have:
Stage.process(Object obj);
whereas I would would conceptually see this as:
Stage.process(String port, Object obj);

And when a pipeline is being assembled a downstream stage is attached to 
a particular port of a stage, not the stage itself. It then just 
recieves data sent to that particular port, but not the other ports.

I'd love to hear how compatible the current system is with this way of 
seeing things. Are we just talking about a new type of Stage 
implementation, or a more fundamental incompatibility at the API level.

Many thanks.


Ken Tanaka wrote:
> Tim Dudgeon wrote:
>> Ken Tanaka wrote:
>>> The Pipeline Basics tutorial has now been incorporated into the 
>>> project page. Thanks to some help and cleanup from Rahul Akolkar the 
>>> documentation submitted was installed quickly. See
>>> -Ken
>> That documentation is really useful. Thanks!
> Wow, someone is actually looking at this. I'll work on cleaning up the 
> documentation some. I hope people realize that some of the color-coded 
> examples got some inadvertent newlines added--but this isn't relevant to 
> your questions.
>> Could I follow up one of the earlier questions in this thread on 
>> branching and merging.
>> From those docs it looks to me like the way data was set to a branch 
>> is a bit strange. There appears to be a FileReaderStage class that has 
>> Java bean property called htmlPipelineKey:
>> <stage className="com.demo.pipeline.stages.FileReaderStage"
>> driverFactoryId="df1" htmlPipelineKey="sales2html"/>
>> and later in the pipeline a branch is defined that names the pipeline 
>> according to that name:
>> <pipeline key="sales2html">
>> This seems pretty inflexible to me. Any branches have to be hardcoded 
>> into the stage definition. I was expecting a situation where multiple 
>> stages could be the recipients of the output of any stage, and these 
>> can be "wired up" dynamically. e.g. something like this:
>>          |--stage2
>>          |
>> stage1---+--stage3
>>          |
>>          |--stage4
>> so that all you needed to do was to define a stage5 as one more 
>> downstream stage for stage 1 and it would transparently receive the data.
>> Is this possible, or does the branching have to be hard-coded into the 
>> stage definition?
> I wouldn't call the way branches are specified "hard coding", since the 
> xml file here is a configuration file. For our current use, branches are 
> pretty rare, so the pipeline framework deals best with simple cases that 
> are fairly linear. Also, if stage1 is a branching stage, then that stage 
> was written with branching in mind, and the "htmlPipelineKey" is a 
> hard-coded property name in the stage source code, so it can direct 
> output when it passes data out to the framework. To simplify matters, 
> all your branching stages could follow a convention of using "branchKey" 
> (or some other generic name), then you wouldn't have to remember what 
> variable holds the branch name for which stage.
> A stage could be written to take an arbitrary number of branch names, 
> and thus send output down multiple branches, although it can get 
> complicated configuring rules on what goes where if the same thing isn't 
> going to all the branches. So rather than making stage1 a branching 
> stage, it could be followed by "stageMulti", which would send copies of 
> it's input to a number of outputs:
>                  |-----stage2
>                  |
> stage1----stageMulti----stage3
>                  |
>                  |-----stage4
> stageMulti could then be used to add branching to any stage it follows.
> I can imagine making configuration files a little simpler with regards 
> to setting up branching, but the more intelligent configuration file 
> reader to handle that hasn't been written.
>> Similarly for merging. To follow up the previous question, let say I 
>> had stageA that output some A's and stage B that output some B's (lets 
>> assume both A's and B's are simple numbers). Now I wanted to have a 
>> stageC that takes all A's and all B's and generates some output with 
>> the, (lets assume the output is A * B so that every combination of A * 
>> B is output). So this would look like this:
>> stageA--+
>>         |
>>         |----stageC
>>         |
>> stageB--+
>> Is it possble to do this, so that stageA and stageB are both writing 
>> to stageC, but that stageC can distinguish the 2 different streams of 
>> data?
> First off, the current design expects all pipelines to start with one 
> stage, to accept feed values out of the config file (or place command 
> line arguments into the first stage queue if the main pipeline 
> application was been written to do that). So maybe you have a stageInit 
> which takes a single number like "3"
> feed "3" --> stageInit----stageA
>                |
>                ----------stageB
> stageInit can then pass "3" on to stageA and stageB, possibly causing 
> stageA to create 3 2-digit numbers and stageB to create 3 3-digit numbers.
> For merging, stageC will accept normal input from a stage as well as 
> watch for events carrying additional data. stageC may well have to 
> accumulate input and then produce output as events are received. Stages 
> normally accept one input, which is either a feed or the output of the 
> stage immediately preceding them. Input from elsewhere or from more than 
> one source is currently handled as events raised by the source and 
> received by a "notify" method in the receiving stage.
> feed "3" --> stageInit----stageA-------------stageC --> 10*111, 10*222, 
> 10*333, 20*111, 20*222, 20*333, 30*111....
>                |      3          10, 20, 30    :
>                ----------stageB................:
>                       3          111, 222, 333
> ---- normal data flow
> .... event passed data
> Like branching, for our uses merging is rare. Also beware of running out 
> of memory if you are doing any accumulation of data to merge input from 
> more than one stage.
> -Ken

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message