nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Kawamura <ijokaruma...@gmail.com>
Subject Re: [EXT] Re: Primary Only Content Migration
Date Fri, 08 Jun 2018 01:48:10 GMT
There is an existing JIRA submitted by Pierre.
I think its goal is the same with what Joe mentioned above.
https://issues.apache.org/jira/browse/NIFI-4026

As for hashing and routing data with affinity/correlation, I think
'Consistent Hashing' is the most popular approach to minimize the
impact of node addition/deletion.
Applying Consistent Hashing to S2S client may not be difficult. The
challenging part is how to support cluster topology change in the
middle of transferring data that needs correlation.

A simple challenging scenario:
Let's say there is a group of 4 FlowFiles having correlation id as 'rel-A'
1. Client sends rel-A, data-1of4 to Node1
2. Client sends rel-A, data-2of4 to Node1
3. NodeN is added and it takes some part in hash key space that Node1
was assigned to
4. Client sends rel-A, data-3of4 to NodeN
5. Client sends rel-A, data-4of4 to NodeN

Then, a Merge processor running on Node1 and NodeN can not complete
because it won't have the whole dataset to merge.
This situation can be handled manually if we document it well.
Or adding resending loop, so that:

5. Client on Node1 resends rel-A, data1of4 to NodeN
6. Client on Node1 resends rel-A, data2of4 to NodeN
7. Merge processor on NodeN merges the FlowFiles.

I'm interested in working on this improvement, too.

Thanks,
Koji


On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <joe.witt@gmail.com> wrote:
> Peter
>
> I'm not sure there is a good way for a processor to drive such a thing
> with existing infrastructure.  The processor having ability to know
> about the structure of a cluster is not something we have wanted to
> expose for good reasons.  There would likely need to be a more
> fundamental point of support for this.
>
> I'm not sure what that design would look like just yet - but agreeing
> this is an important step to take soon.  If you want to start
> sketching out design ideas that would be awesome.
>
> Thanks
> On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pwicks@micron.com> wrote:
>>
>> Joe,
>>
>> I agree it is a lot of work, which is why I was thinking of starting with a processor
that could do some of these operations before looking further. If the processor could move
flowfile's between nodes in the cluster it would be a good step. Data comes in form a queue
on any node, but gets written out to a queue on only the desired node; or gets round robin
outputted for a distribute scenario.
>>
>> I want to work on it, and was trying to figure out if it could be done using only
a processor, or if larger changes would be needed for sure.
>>
>> --Peter
>>
>> -----Original Message-----
>> From: Joe Witt [mailto:joe.witt@gmail.com]
>> Sent: Thursday, June 7, 2018 3:34 PM
>> To: dev@nifi.apache.org
>> Subject: Re: [EXT] Re: Primary Only Content Migration
>>
>> Peter,
>>
>> It isn't a pattern that is well supported now in a cluster context.
>>
>> What is needed are automatically load balanced connections with partitioning.  This
would mean a user could select a given relationship and indicate that data should automatically
distributed and they should be able to express, optionally, if there is a correlation attribute
that is used for ensuring data which belongs together stays together or becomes together.
 We could use this to automatically have a connection result in data being distributed across
the cluster for load balancing purposes and also ensure that data is brought back to a single
node whenever necessary which is the case in certain scenarios like fork/distribute/process/join/send
and things like distributed receipt then join for merging (like defragmenting data which has
been split).  To join them together we need affinity/correlation and this could work based
on some sort of hashing mechanism where there are as many buckets as their are nodes in a
cluster at a given time.  It needs a lot of thought/design/testing/etc..
>>
>> I was just having a conversation about this yesterday.  It is definitely a thing
and will be a major effort.  Will make a JIRA for this soon.
>>
>> Thanks
>>
>> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pwicks@micron.com> wrote:
>> > Bryan,
>> >
>> > We see this with large files that we have split up into smaller files and distributed
across the cluster using site-to-site. We then want to merge them back together, so we send
them to the primary node before continuing processing.
>> >
>> > --Peter
>> >
>> > -----Original Message-----
>> > From: Bryan Bende [mailto:bbende@gmail.com]
>> > Sent: Thursday, June 7, 2018 12:47 PM
>> > To: dev@nifi.apache.org
>> > Subject: [EXT] Re: Primary Only Content Migration
>> >
>> > Peter,
>> >
>> > There really shouldn't be any non-source processors scheduled for primary node
only. We may even want to consider preventing that option when the processor has an incoming
connection to avoid creating any confusion.
>> >
>> > As long as you set source processors to primary node only then everything should
be ok... if primary node changes, the source processor starts executing on the new primary
node, and any flow files it already produced on the old primary node will continue to be worked
off by the downstream processors on the old node until they are all processed.
>> >
>> > -Bryan
>> >
>> >
>> >
>> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pwicks@micron.com>
wrote:
>> >> I'm sure many of you have the same situation, a flow that runs on a cluster,
and at some point merges back down to a primary only processor; your files sit there in the
queue with nowhere to go... We've used the work around of having a remote processor group
that loops the data back to the primary node for a while, but would really like a clean/simple
solution. This approach requires that users be able to put an input port on the root flow,
and then route the file back down, which is a nuisance.
>> >>
>> >> I have been thinking of adding either a processor that moves data between
specific nodes in a cluster, or a queue (?) option that will let users migrate the content
of a flowfile back to the master node. This would allow you to move data back to a primary
very easily without needing RPG's and input ports at the root level.
>> >>
>> >> All of my development work with NiFi has been focused on processors, so
I'm not really sure where I would start with this.  Thoughts?
>> >>
>> >> Thanks,
>> >>   Peter

Mime
View raw message