nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giovanni Lanzani <giovannilanz...@godatadriven.com>
Subject RE: Keep attributes when merging
Date Tue, 29 Nov 2016 15:17:58 GMT
Hi Mark,

I was missing this bit!

Thanks a lot, correlation attribute name is indeed what I wanted!

Giovanni

> -----Original Message-----
> From: Mark Payne [mailto:markap14@hotmail.com]
> Sent: Tuesday, November 29, 2016 4:16 PM
> To: users@nifi.apache.org
> Subject: Re: Keep attributes when merging
> 
> Giovanni,
> 
> In the scenario that you laid out here, the merged FlowFile will not have a 'dt'
> attribute because there are conflicting values for the 'dt' attribute. As a result,
> the attribute is not carried through.
> 
> If it is important to you that this attribute be carried through, you can set the
> "Correlation Attribute Name"
> property to 'dt'. This will cause the processor to only bin together FlowFiles
> that have the same value for the 'dt' attribute. As a result, since there will be
> no conflicting values for the attribute, the merged FlowFile will also have this
> attribute.
> 
> Thanks
> -Mark
> 
> 
> 
> 
> > On Nov 29, 2016, at 9:34 AM, Giovanni Lanzani
> <giovannilanzani@godatadriven.com> wrote:
> >
> > Hi Joe,
> >
> > I still have troubles following you.
> >
> > Let's assume I have the MergeContent processor with the "Keep only
> common Attributes" strategy. The flow files are coming in like so:
> >
> > ff_1 (attribute dt = 20161120)
> > ff_2 (attribute dt = 20161120)
> > ff_3 (attribute dt = 20161121)
> > ff_4 ((attribute dt = 20161120)
> >
> > If my Minimum Number of Entries in MergeContent is set to 4, what dt
> attribute will the flow file coming out of the MergeContent processor have?
> 20161120 or 20161121?
> >
> > Or is NiFi capable of waiting to have enough flow files with each
> > unique value of dt before merging? If so, I think the docs could use
> > some help :)
> >
> > From what I could see, that dt attribute was gone after the merge, but
> maybe I'm doing it wrong.
> >
> > Cheers,
> >
> > Giovanni
> >
> >
> >
> >> -----Original Message-----
> >> From: Joe Witt [mailto:joe.witt@gmail.com]
> >> Sent: Tuesday, November 29, 2016 3:25 PM
> >> To: users@nifi.apache.org
> >> Subject: Re: Keep attributes when merging
> >>
> >> Giovanni
> >>
> >> You can definitely do this.  The file pulling should be retaining the
> >> key path information as flow file attributes.
> >>
> >> The merge process has a property to control what happens with attributes.
> >> The default is to only copy over matching attributes and is likely
> >> what you'll want.  Take a look at "Attribute Strategy".  Now you want
> >> to retain some key values of course and that would be the parts of
> >> the timestamp you'd want to group on.  You could do this with an
> >> UpdateAttribute processor before MergeContent.  Use that to create an
> >> attribute such as "base-timestamp" or something which just pulls out the
> common part of the timestamp you want.
> >> In MergeContent then you can correlate on this value and since it
> >> will be the same it will also be there for you afterwards.  You can
> >> then use this when writing to HDFS.
> >>
> >> This is a pretty common use case so we can definitely help you get
> >> where you want to go with this.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Tue, Nov 29, 2016 at 9:14 AM, Giovanni Lanzani
> >> <giovannilanzani@godatadriven.com> wrote:
> >>> Hi all,
> >>>
> >>> I have the following use case:
> >>>
> >>> I'm reading xml from a folder with subfolders using the following schema:
> >>>
> >>> /my_folder/20161120/many xml's inside /my_folder/20161121/many
> xml's
> >>> inside /my_folder/201611.../many xml's inside
> >>>
> >>> The current pipeline involves: XML -> JSON -> Avro -> HDFS
> >>>
> >>> where the HDFS folder structure is
> >>>
> >>> /my_folder/column=20161120/many avro's inside
> >>> /my_folder/column=20161121/many avro's inside
> >>> /my_folder/column=201611.../many avro's inside
> >>>
> >>> (each column= subfolder is a Hive partition)
> >>>
> >>> In order to reduce the number of avro's in HDFS, I'd love to merge 'em all.
> >>>
> >>> However, as NiFi just reads files from the source folders without
> >>> any
> >> assumption on from which folders they're taken, even if I extract the
> >> date from the folder name (or file), this gets lost when using
> >> MergeContent. Using the Defragment strategy does not seems like an
> >> option, as I don't know in advance how many files I'll see.
> >>>
> >>> That said: isn't there any way to accomplish what I want to do?
> >>>
> >>> Current strategy is to simply merge the files "manually" using
> >>> avro-tools and
> >> bash scripting.
> >>>
> >>> An alternative (although this is forcing what we want to do), is to
> >>> partition by
> >> import date. Then I'd only need to take care of the midnight issue,
> >> for example by scheduling NiFi to fetch from the source every 10
> >> minutes, but by doing a MergeContent every 5.
> >>>
> >>> If something isn't clear, please let me know.
> >>>
> >>> Thanks,
> >>>
> >>> Giovanni
> >>>
> >>> Thanks,
> >>>
> >>> Giovanni


Mime
View raw message