nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors
Date Wed, 22 Feb 2017 17:47:07 GMT
Adam,

Some great points there.  I think what would be good here to keep in
mind is 'who' will tame these things.

For various patterns that are chosen and abstractions found and code written:
  - The developers do the taming.

For the extension registry and which processors become popular or
become unused and phase out:
 - The users/flow managers do the taming.

It is certainly the case we need to think through a robust plan which
allows both developers and users to provide the feedback and energy
necessary.  To date, we've not allowed the users to have much direct
influence here and we really don't have a strong sense of which
components are most commonly used.  One of the things I am most
excited by with the extension registry and related efforts is that it
will help us make more data driven decisions about where to focus our
energies.

Thanks
Joe

On Wed, Feb 22, 2017 at 12:43 PM, Adam Lamar <adamonduty@gmail.com> wrote:
> Hey all,
>
> I can understand Andre's perspective - when I was building the ListS3
> processor, I mostly just copied the bits that made sense from ListHDFS and
> ListFile. That worked, but its a poor way to ensure consistency across
> List* processors.
>
> As a once-in-a-while contributor, I love the idea that community
> contributions are respected and we're not dropping them, because they solve
> real needs right now, and it isn't clear another approach would be better.
>
> And I disagree slightly with the notion that an artifact registry will
> solve the problem - I think it could make it worse, at least from a
> consistency point of view. Taming _is_ important, which is one reason
> registry communities have official/sanctioned modules. Quality and
> interoperability can vary vastly.
>
> By convention, it seems like NiFi already has a handful of well-understood
> patterns - List, Fetch, Get, Put, etc all mean something specific in
> processor terms. Is there a reason not to formalize those patterns in the
> code as well? That would help with processor consistency, and if done
> right, it may even be easier to write new processors, fix bugs, etc.
>
> For example, ListS3 initially shipped with some bad session commit()
> behavior, which was obvious once identified, but a generalized
> AbstractListProcessor (higher level that the one that already exists) could
> make it easier to avoid this class of bug.
>
> Admittedly this could be a lot of work.
>
> Cheers,
> Adam
>
>
>
> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> I’ll second Pierre
>>
>> Yes with the current deployment model the amount of processors and the
>> size of NiFi distribution is a concern simply because it’s growing with
>> each release. But it should not be the driver to start jamming more
>> functionality into existing processors which on the surface may look like
>> related (even if they are).
>> Basically a processor should never be complex with regard to it being
>> understood by the end user who is non-technical, so “specialization” is
>> always takes precedence here since it limits “configuration” and thus
>> making such processor simpler. It also helps with maintenance and
>> management of such processor by the developer. Also, having multiple
>> related processors will promote healthy competition where my MyputHDFS may
>> for certain cases be better/faster then YourPutHDFS and why not have both?
>>
>> The “artifact registry” (flow, extension, template etc) is the only answer
>> here since it will remove the “proliferation” and the need for “taming”
>> anything from the picture. With “artifact registry” one or one million
>> processors, the NiFi size/state will always remain constant and small.
>>
>> Cheers
>> Oleg
>> > On Feb 22, 2017, at 6:05 AM, Pierre Villard <pierre.villard.fr@gmail.com>
>> wrote:
>> >
>> > Hey guys,
>> >
>> > Thanks for the thread Andre.
>> >
>> > +1 to James' answer.
>> >
>> > I understand the interest that would provide a single processor to
>> connect
>> > to all the back ends... and we could document/improve the PutHDFS to ease
>> > such use but I really don't think that it will benefit the user
>> experience.
>> > That may be interesting in some cases for some users but I don't think
>> that
>> > would be a majority.
>> >
>> > I believe NiFi is great for one reason: you have a lot of specialized
>> > processors that are really easy to use and efficient for what they've
>> been
>> > designed for.
>> >
>> > Let's ask ourselves the question the other way: with the NiFi registry on
>> > its way, what is the problem having multiple processors for each back
>> end?
>> > I don't really see the issue here. OK we have a lot of processors (but I
>> > believe this is a good point for NiFi, for user experience, for
>> > advertising, etc. - maybe we should improve the processor listing though,
>> > but again, this will be part of the NiFi Registry work), it generates a
>> > heavy NiFi binary (but that will be solved with the registry), but that's
>> > all, no?
>> >
>> > Also agree on the positioning aspect: IMO NiFi should not be highly tied
>> to
>> > the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
>> no
>> > relation to Hadoop. Not sure that would send the good "signal".
>> >
>> > Pierre
>> >
>> >
>> >
>> >
>> > 2017-02-22 6:50 GMT+01:00 Andre <andre-lists@fucs.org>:
>> >
>> >> Andrew,
>> >>
>> >>
>> >> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <aperepel@gmail.com>
>> >> wrote:
>> >>
>> >>> I am observing one assumption in this thread. For some reason we are
>> >>> implying all these will be hadoop compatible file systems. They don't
>> >>> always have an HDFS plugin, nor should they as a mandatory requirement.
>> >>>
>> >>
>> >> You are partially correct.
>> >>
>> >> There is a direct assumption in the availability of a HCFS (thanks
>> Matt!)
>> >> implementation.
>> >>
>> >> This is the case with:
>> >>
>> >> * Windows Azure Blob Storage
>> >> * Google Cloud Storage Connector
>> >> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>> >> * Alluxio
>> >> * Isilon (via HDFS)
>> >> * others
>> >>
>> >> But I would't say this will apply to every other use storage system and
>> in
>> >> certain cases may not even be necessary (e.g. Isilon scale-out storage
>> may
>> >> be reached using its native HDFS compatible interfaces).
>> >>
>> >>
>> >> Untie completely from the Hadoop nar. This allows for effective minifi
>> >>> interaction without the weight of hadoop libs for example. Massive size
>> >>> savings where it matters.
>> >>>
>> >>>
>> >> Are you suggesting a use case were MiNiFi agents interact directly with
>> >> cloud storage, without relying on NiFi hubs to do that?
>> >>
>> >>
>> >>> For the deployment, it's easy enough for an admin to either rely on
a
>> >>> standard tar or rpm if the NAR modules are already available in the
>> >> distro
>> >>> (well, I won't talk registry till it arrives). Mounting a common
>> >> directory
>> >>> on every node or distributing additional jars everywhere, plus configs,
>> >> and
>> >>> then keeping it consistent across is something which can be avoided
by
>> >>> simpler packaging.
>> >>>
>> >>
>> >> As long the NAR or RPM supports your use-case, which is not the case of
>> >> people running NiFi with MapR-FS for example. For those, a
>> recompilation is
>> >> required anyway. A flexible processor may remove the need to recompile
>> (I
>> >> am currently playing with the classpath implication to MapR users).
>> >>
>> >> Cheers
>> >>
>>
>>

Mime
View raw message