nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject Re: Nifi Clustering - work distribution on workers
Date Wed, 14 Oct 2015 21:06:49 GMT
Mans,

Nodes in a cluster work independently from one another and do not know about each other. That
is accurate.
Each node in a cluster runs the same flow. Typically, if you want to pull from HDFS and partition
that data
across the cluster, you would run ListHDFS on the Primary Node only, and then use Site-to-Site
[1] to distribute
that listing to all nodes in the cluster. Each node would then pull the data that it is responsible
to pull and begin
working on it. We do realize that this is not ideal to have to setup this way, and it is something
that we are working
on so that it is much easier to have that listing automatically distributed across the cluster.

I'm not sure that I understand your #3 - how do we design the workflow so that the nodes work
on one file at a time?
For each Processor, you can configure how many threads (Concurrent Tasks) are to be used in
the Scheduling tab
of the Processor Configuration dialog. You can certainly configure that to run only a single
Concurrent Task. 
This is the number of Concurrent Tasks that will run on each node in the cluster, not the
total number of concurrent
tasks that would run across the entire cluster.

I am not sure that I understand your #4 either. Are you indicating that you want to configure
each node in the cluster
with a different value for a processor property?

Does this help?

Thanks
-Mark

[1] http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site


> On Oct 14, 2015, at 4:49 PM, M Singh <mans2singh@yahoo.com> wrote:
> 
> Hi:
> 
> 
> 
> A few questions about NiFi cluster:
> 
> 1. If we have multiple worker nodes in the cluster, do they partition the work if the
source allows partitioning - eg: HDFS, or do all the nodes work on the same data ?
> 2. If the nodes partition the work, then how do they coordinate the work distribution
and recovery etc ?  From the documentation it appears that the workers are not aware of each
other.
> 3. If I need to process multiple files - how do we design the work flow so that the nodes
work on one file at a time ?
> 4. If I have multiple arguments and need to pass one parameter to each worker, how can
I do that ?
> 5. Is there any way to control how many workers are involved in processing the flow ?
> 6. Does specifying the number of threads in the processor distribute work on multiple
workers ?  Does it split the task across the threads or is it the responsibility of the application
?
> 
> I tried to find some answers from the documentation and users list but could not get
a clear picture.
> 
> Thanks
> 
> Mans
> 
> 
> 
> 


Mime
View raw message