incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Reale <andrea.re...@unibo.it>
Subject Re: Modeling a pipeline
Date Tue, 17 Jul 2012 09:53:05 GMT
Hi Matthieu,

thanks for your detailed and precious comments.

Just a thought about your first point that I would like to share: my PEs
are basically stateless, so in theory I can partition any stream with
arbitrary functions and the result would still be correct.
So, let's assume that I use an arbitrary hash function that partitions
the events of my streams in, say, k partitions. This way I will have
k PE's for each of my original processing steps. Let's say that the
number of processing steps is s.
In theory, this way I could use up to k*s processors in parallel to
execute my model, and that's cool!

My question is, do you think that this partitioning would still be
beneficial if I have a number of nodes a lot smaller than k*s?
Maybe one could think about an extensions that (for stateless PEs)
automatically partitions their input streams according to the number of
available processing nodes...


Have a nice day!
Andrea

On Tue, 2012-07-17 at 10:52 +0200, Matthieu Morel wrote:
> Hi Andrea,
>
> that's an interesting use case, let me add a few things to Shailendra's
> comments:
>
> - S4 typically works best when you can split the processing into many
> partitions and therefore many PE instances. Those instances are spread
> across the multiple nodes, which allows for efficient parallel
> processing, and scaling is straightforward: just add more S4 nodes. If
> you can partition the input to your processing elements, that's ideal,
> you'll get more processing power as you add more nodes.
>
> - as you observed, horizontal scaling is currently static: you need to
> restart the cluster after you add a partition. We are integrating
> checkpointing in piper so that you can minimize state loss in that case.
>
> - by default the partitioning is performed through a hashing function,
> so it's hard to predict which node each message will be sent to. In your
> design with unique keys, all messages for a stream always go to the same
> node, but if you want to control which one, you'll need to modify the
> partitioning function accordingly. We haven't added hooks for that yet
> though, you'll have to delve into the source code. But we have ongoing
> efforts related to PE isolation that will do exactly that: exclusively
> assign PEs to nodes (which translates into: messages to that PE only go
> to that node).
>
> Hope this helps!
>
> Matthieu
>
>
>
> On 7/17/12 1:13 AM, Andrea Reale wrote:
> > Shailendra,
> > thanks again for your comments.
> >
> > Yes, I understand what you are saying, but I guess that in such a case a system
restart (including the re-definition of the cluster, now with five nodes, and the re-deployment
of the application) would allow  to leverage the fifth node as well. Of course one would rather
like not having to restart the system, but at least you don't have to change and rebuild any
code.
> >
> > Regards,
> > Andrea
> >
> > Shailendra Mishra <shailendrah@gmail.com> wrote:
> >
> >
> > Not really, let's assume that you have 30 PE's each executing one
> > transform. Now you decided to add a 5th physical host which implies
> > that you will have to start a cluster on this host. If you do that
> > then even though you deploy a PE, it won't be recognized by the
> > upstream PE for the simple reason that the input stream to the PE was
> > not registered with the Zookeeper when the upstream app (whatever it
> > may be adaptor or a partitioner etc) was deployed. We need to do more
> > work for that to happen.
> > - Shailendra
> >
> > On Mon, Jul 16, 2012 at 3:03 PM, Andrea Reale <andrea.reale@unibo.it> wrote:
> >> Hi Shailendra,
> >>
> >> thank you very much for your help. I am exploring S4 for the first time
> >> these days and I am not always sure that I am using the right approach.
> >>
> >> About your observation: you're totally right, I could aggregate multiple
> >> transformations in a single PE and that would certainly reduce the
> >> amount of communication. Indeed, I thank that ideally the best approach
> >> is to aggregate transformations so that I have just one PE per CPU.
> >> On the other hand, maybe reducing the number of PEs reduces the
> >> possibility of the system  to "automagically" scale when I add nodes to
> >> the cluster. For example, if I model the pipeline with 4 PEs and I have
> >> four machines (with 1 processor each) I am OK; but, if I add one node,
> >> then I should modify my application in order to use the computing power
> >> of the fifth node. Am I wrong?
> >>
> >> Secondly, I have "made up" this scenario just to be able to experiment
> >> with the system and measure its ability to "scale up", hence I need a
> >> scenario that is simple and flexible.
> >> Any possible suggestion in this direction will be very appreciated!
> >>
> >> Thanks once again!
> >>
> >> Regards,
> >> Andrea
> >>
> >> On Mon, 2012-07-16 at 13:20 -0700, Shailendra Mishra wrote:
> >>> The approach looks OK and since I don't know the nature of the
> >>> transforms hence the question:
> >>> Wouldn't you like to put multiple transforms (pick a no. like say 5),
> >>> so as to decrease the communication complexity ? or the transforms so
> >>> compute intensive that you have to allocate a PE for each". -
> >>> Shailendra
> >>>
> >>> On Mon, Jul 16, 2012 at 12:02 PM, Andrea Reale <andrea.reale@unibo.it>
wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I am trying to run some tests on piper, running in in a cluster of four
> >>>> node in a particular image processing scenario. I am writing this
> >>>> message to possibly gather opinions about the way I am currently my
> >>>> problem, and whether there is some better way to do it.
> >>>>
> >>>>
> >>>> My scenario is the following:
> >>>> I have a series of OpenCV transformations that can be modeled as a
> >>>> pipeline of 30 stages (at each stage one transformation is applied).
> >>>> My modeling approach is to model each stage as a ProcessingElement,
so
> >>>> that the graph is basically something like:
> >>>>
> >>>> .-------.     .-----.                .--- --.       .-----.
> >>>> | Adpt. |-c0->| St1 |-c1-> ... -c29->| St30 | -c30->| Snk
|
> >>>> '-------'     '-----'                '------'       '-----'
> >>>>
> >>>> Where:
> >>>> -Adpt. is an adapter that reads image samples and creates corresponding
> >>>> S4 events
> >>>> - St1-30 are the OpenCVPEs
> >>>> - Snk is a PE that stores the results
> >>>> - c0-c30 are 30 different streams
> >>>>
> >>>> Since in this case I do not need events to be processed by keys, I
> >>>> assign to every stream an instance of a KeyFinder which returns a
> >>>> constant; in other words KeyFinder#get() returns the same value
> >>>> independently on the particular event (practically, it returns the name
> >>>> of the stream).
> >>>>
> >>>>
> >>>> While I already made some experiments with this approach (and it appears
> >>>> to work), I would like to know whether you believe there is a better
> >>>> approach for modeling my problem in terms of S4 PEs / streams.
> >>>>
> >>>> Thanks a lot!
> >>>> Andrea
> >>>>
> >>>>
> >>>>
> >>>> LA RICERCA C’È E SI VEDE:
> >>>> 5 per mille all'Università di Bologna - C.F.: 80007010376
> >>>> http://www.unibo.it/5permille
> >>>>
> >>>> Questa informativa è inserita in automatico dal sistema al fine esclusivo
della realizzazione dei fini istituzionali dell’ente.
> >>
> >>
> >>
> >> LA RICERCA C’È E SI VEDE:
> >> 5 per mille all'Università di Bologna - C.F.: 80007010376
> >> http://www.unibo.it/5permille
> >>
> >> Questa informativa è inserita in automatico dal sistema al fine esclusivo della
realizzazione dei fini istituzionali dell’ente.
> >
> > LA RICERCA C’È E SI VEDE:
> > 5 per mille all'Università di Bologna - C.F.: 80007010376
> > http://www.unibo.it/5permille
> >
> > Questa informativa è inserita in automatico dal sistema al fine esclusivo della
realizzazione dei fini istituzionali dell’ente.
> >
>
>



LA RICERCA C’È E SI VEDE:
5 per mille all'Università di Bologna - C.F.: 80007010376
http://www.unibo.it/5permille

Questa informativa è inserita in automatico dal sistema al fine esclusivo della realizzazione
dei fini istituzionali dell’ente.

Mime
View raw message