nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: [EXT] Re: Primary Only Content Migration
Date Fri, 08 Jun 2018 16:32:37 GMT
Certainly there are simpler approaches to take in the mean time as
this case.  Specifically using ListenHTTP/PostHTTP with an intentional
configuration to do all merging of split items on a specific node can
do the trick.

For a more general purpose and consumable solution it is worth noting:
- relying on 'primary node' depends on the primary node being the same
throughout the transfer of all parts.  Not a safe assumption of course
and it isn't in-line with the intent of the primary node construct at
this stage.

- relying on a specific host to return to based on the original host
it came from would only work in cases where data is split on an origin
system, distributed, then merged back to a single node that is the
same host.  This doesn't help cases where data doesn't have a single
origin node and it suffers from the issue of availability of the
origin node during the merge phase.

A more general purpose solution will ensure that data can be
recombined reliable on a given node in such a way that all
'correlated' items make their way back together in time and that we
distribute such merging across the cluster (think site to site with
node affinity on some hashed correlation value).

So yep - agreed - there are 'now options' with limitations and there
is a good path which will be a lot of work but very useful and
reliable.

Thanks
Joe



On Fri, Jun 8, 2018 at 12:21 PM, Brandon DeVries <brd@jhu.edu> wrote:
> In this particular case, the easiest thing to do may be to keep track of
> the hostname of the original node of the file (before split and
> distribution).  Then, just send all of the pieces back to the originator
> node when done, and merge there.  It would be nice to have automatic queue
> balancing of some sort, but that's going to be a major effort, and not
> really necessary for Peter's issue.
>
> Brandon
>
> On Fri, Jun 8, 2018 at 10:25 AM Bryan Bende <bbende@gmail.com> wrote:
>
>> I'm also curious to know how Peter is currently handling this since he
>> mentioned already sending back to primary node, and we don't yet have
>> any of the functionality that Joe, Koji, and Pierre mentioned, so the
>> only way I can think of doing that would be to use PostHttp/InvokeHttp
>> and point directly at the URL of the primary node, although that would
>> be problematic when it changes.
>>
>> For Peter's scenario it sounds like he wants to run MergeContent in
>> defragment mode on the primary node after converging all the splits to
>> that node. In that case it may be challenging to handle the scenarios
>> like Koji mentioned... if we are half way through converging
>> everything to primary node and then primary node changes, do we
>> automatically move all of the data that was already sent to the old
>> primary over to the new primary so the defragment can continue
>> successfully? Maybe there is a special type of "merge" or "reduce"
>> component that can work across the cluster and handle this.
>>
>>
>> On Fri, Jun 8, 2018 at 4:01 AM, Pierre Villard
>> <pierre.villard.fr@gmail.com> wrote:
>> > Hi guys,
>> >
>> > Koji is right, I initially filed NIFI-4026 to cover this kind of use
>> case.
>> >
>> > There is a lot of challenges and ways to address this subject.
>> > Auto-balancing in queues will be a super nice way forward this goal.
>> >
>> > I think an easy first step would be to add a checkbox in RPG
>> configuration
>> > allowing the user to specifically send the data to the remote primary
>> node
>> > only. That's an easy information to add in S2S peer status data requested
>> > by clients. It would need proper documentation when we have nodes
>> > disconnecting/reconnecting, but I think that's the "easiest" improvement
>> to
>> > achieve your goal here.
>> >
>> > Since the changes are rather complex, I think we need to carefully think
>> > about the solution design here.
>> >
>> > Pierre
>> >
>> >
>> >
>> > 2018-06-08 3:48 GMT+02:00 Koji Kawamura <ijokarumawak@gmail.com>:
>> >
>> >> There is an existing JIRA submitted by Pierre.
>> >> I think its goal is the same with what Joe mentioned above.
>> >> https://issues.apache.org/jira/browse/NIFI-4026
>> >>
>> >> As for hashing and routing data with affinity/correlation, I think
>> >> 'Consistent Hashing' is the most popular approach to minimize the
>> >> impact of node addition/deletion.
>> >> Applying Consistent Hashing to S2S client may not be difficult. The
>> >> challenging part is how to support cluster topology change in the
>> >> middle of transferring data that needs correlation.
>> >>
>> >> A simple challenging scenario:
>> >> Let's say there is a group of 4 FlowFiles having correlation id as
>> 'rel-A'
>> >> 1. Client sends rel-A, data-1of4 to Node1
>> >> 2. Client sends rel-A, data-2of4 to Node1
>> >> 3. NodeN is added and it takes some part in hash key space that Node1
>> >> was assigned to
>> >> 4. Client sends rel-A, data-3of4 to NodeN
>> >> 5. Client sends rel-A, data-4of4 to NodeN
>> >>
>> >> Then, a Merge processor running on Node1 and NodeN can not complete
>> >> because it won't have the whole dataset to merge.
>> >> This situation can be handled manually if we document it well.
>> >> Or adding resending loop, so that:
>> >>
>> >> 5. Client on Node1 resends rel-A, data1of4 to NodeN
>> >> 6. Client on Node1 resends rel-A, data2of4 to NodeN
>> >> 7. Merge processor on NodeN merges the FlowFiles.
>> >>
>> >> I'm interested in working on this improvement, too.
>> >>
>> >> Thanks,
>> >> Koji
>> >>
>> >>
>> >> On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <joe.witt@gmail.com> wrote:
>> >> > Peter
>> >> >
>> >> > I'm not sure there is a good way for a processor to drive such a thing
>> >> > with existing infrastructure.  The processor having ability to know
>> >> > about the structure of a cluster is not something we have wanted to
>> >> > expose for good reasons.  There would likely need to be a more
>> >> > fundamental point of support for this.
>> >> >
>> >> > I'm not sure what that design would look like just yet - but agreeing
>> >> > this is an important step to take soon.  If you want to start
>> >> > sketching out design ideas that would be awesome.
>> >> >
>> >> > Thanks
>> >> > On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <
>> pwicks@micron.com>
>> >> wrote:
>> >> >>
>> >> >> Joe,
>> >> >>
>> >> >> I agree it is a lot of work, which is why I was thinking of starting
>> >> with a processor that could do some of these operations before looking
>> >> further. If the processor could move flowfile's between nodes in the
>> >> cluster it would be a good step. Data comes in form a queue on any node,
>> >> but gets written out to a queue on only the desired node; or gets round
>> >> robin outputted for a distribute scenario.
>> >> >>
>> >> >> I want to work on it, and was trying to figure out if it could
be
>> done
>> >> using only a processor, or if larger changes would be needed for sure.
>> >> >>
>> >> >> --Peter
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Joe Witt [mailto:joe.witt@gmail.com]
>> >> >> Sent: Thursday, June 7, 2018 3:34 PM
>> >> >> To: dev@nifi.apache.org
>> >> >> Subject: Re: [EXT] Re: Primary Only Content Migration
>> >> >>
>> >> >> Peter,
>> >> >>
>> >> >> It isn't a pattern that is well supported now in a cluster context.
>> >> >>
>> >> >> What is needed are automatically load balanced connections with
>> >> partitioning.  This would mean a user could select a given relationship
>> and
>> >> indicate that data should automatically distributed and they should be
>> able
>> >> to express, optionally, if there is a correlation attribute that is used
>> >> for ensuring data which belongs together stays together or becomes
>> >> together.  We could use this to automatically have a connection result
>> in
>> >> data being distributed across the cluster for load balancing purposes
>> and
>> >> also ensure that data is brought back to a single node whenever
>> necessary
>> >> which is the case in certain scenarios like
>> fork/distribute/process/join/send
>> >> and things like distributed receipt then join for merging (like
>> >> defragmenting data which has been split).  To join them together we need
>> >> affinity/correlation and this could work based on some sort of hashing
>> >> mechanism where there are as many buckets as their are nodes in a
>> cluster
>> >> at a given time.  It needs a lot of thought/design/testing/etc..
>> >> >>
>> >> >> I was just having a conversation about this yesterday.  It is
>> >> definitely a thing and will be a major effort.  Will make a JIRA for
>> this
>> >> soon.
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <
>> pwicks@micron.com>
>> >> wrote:
>> >> >> > Bryan,
>> >> >> >
>> >> >> > We see this with large files that we have split up into smaller
>> files
>> >> and distributed across the cluster using site-to-site. We then want to
>> >> merge them back together, so we send them to the primary node before
>> >> continuing processing.
>> >> >> >
>> >> >> > --Peter
>> >> >> >
>> >> >> > -----Original Message-----
>> >> >> > From: Bryan Bende [mailto:bbende@gmail.com]
>> >> >> > Sent: Thursday, June 7, 2018 12:47 PM
>> >> >> > To: dev@nifi.apache.org
>> >> >> > Subject: [EXT] Re: Primary Only Content Migration
>> >> >> >
>> >> >> > Peter,
>> >> >> >
>> >> >> > There really shouldn't be any non-source processors scheduled
for
>> >> primary node only. We may even want to consider preventing that option
>> when
>> >> the processor has an incoming connection to avoid creating any
>> confusion.
>> >> >> >
>> >> >> > As long as you set source processors to primary node only
then
>> >> everything should be ok... if primary node changes, the source processor
>> >> starts executing on the new primary node, and any flow files it already
>> >> produced on the old primary node will continue to be worked off by the
>> >> downstream processors on the old node until they are all processed.
>> >> >> >
>> >> >> > -Bryan
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <
>> >> pwicks@micron.com> wrote:
>> >> >> >> I'm sure many of you have the same situation, a flow that
runs on
>> a
>> >> cluster, and at some point merges back down to a primary only processor;
>> >> your files sit there in the queue with nowhere to go... We've used the
>> work
>> >> around of having a remote processor group that loops the data back to
>> the
>> >> primary node for a while, but would really like a clean/simple solution.
>> >> This approach requires that users be able to put an input port on the
>> root
>> >> flow, and then route the file back down, which is a nuisance.
>> >> >> >>
>> >> >> >> I have been thinking of adding either a processor that
moves data
>> >> between specific nodes in a cluster, or a queue (?) option that will let
>> >> users migrate the content of a flowfile back to the master node. This
>> would
>> >> allow you to move data back to a primary very easily without needing
>> RPG's
>> >> and input ports at the root level.
>> >> >> >>
>> >> >> >> All of my development work with NiFi has been focused
on
>> processors,
>> >> so I'm not really sure where I would start with this.  Thoughts?
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>   Peter
>> >>
>>

Mime
View raw message