nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Conrad Crampton <>
Subject Re: Spark or custom processor?
Date Thu, 02 Jun 2016 13:53:59 GMT
Bryan, thanks for this. I wasn’t’ aware of the nuances between UDP and TCP syslog streams.
The RPG approach certainly makes my head hurt every time I refer back to it so certainly would
simplify things – and also remove the single point of failure with processing on primary
node only (but introducing another one by way of UDP load balancer).

From: Bryan Bende <>
Reply-To: "" <>
Date: Thursday, 2 June 2016 at 14:47
To: "" <>
Subject: Re: Spark or custom processor?

In addition to what Mark said, regarding getting the logs into NiFi...

I've found that when syslog servers forward messages over TCP, they typically open a single
connection, so in this case sending to the primary node probably makes sense.

If you are forwarding over UDP, you might be able to stick a UDP load balancer in front of
of NiFi so that the logs are being routed to the ListenSyslog on each node, and then you wouldn't
need the RPG.

Just something to think about.

On Thu, Jun 2, 2016 at 9:42 AM, Mark Payne <<>>

Typically, the way that we like to think about using NiFi vs. something like Spark or Storm
is whether
the processing is Simple Event Processing or Complex Event Processing. Simple Event Processing
encapsulates those tasks where you are able to operate on a single piece of data by itself
(or in correlation
with an Enrichment Dataset). So tasks like enrichment, splitting, and transformation are squarely
the wheelhouse of NiFi.

When we talk about doing Complex Event Processing, we are generally talking about either processing
from multiple streams together (think JOIN operations) or analyzing data across time windows
(think calculating
norms, standard deviation, etc. over the last 30 minutes). The idea here is to derive a single
new "insight" from
windows of data or joined streams of data - not to transform or enrich individual pieces of
data. For this, we would
recommend something like Spark, Storm, Flink, etc.

In terms of scalability, NiFi certainly was not designed to scale outward in the way that
Spark was. With Spark you
may be scaling to thousands of nodes, but with NiFi you would get a pretty poor user experience
because each change
in the UI must be replicated to all of those nodes. That being said, NiFi does scale up very
well to take full advantage
of however much CPU and disks you have available. We typically see processing of several terabytes
of data per day
on a single node, so we have generally not needed to scale out to hundreds or thousands of

I hope this helps to clarify when/where to use each one. If there are things that are still
unclear or if you have more
questions, as always, don't hesitate to shoot another email!


On Jun 2, 2016, at 9:28 AM, Conrad Crampton <<>>

ListenSyslog (using the approach that is being discussed currently in another thread – ListenSyslog
running on primary node as RGP, all other nodes connecting to the port that the RPG exposes).
Various enrichment, routing on attributes etc. and finally into HDFS as Avro.
I want to branch off at an appropriate point in the flow and do some further realtime analysis
– got the output to port feeding to Spark process working fine (notwithstanding the issue
that you have been so kind to help with previously with the SSLContext), just thinking about
if this is most appropriate solution.

I have dabbled with a custom processor (for enriching url splitting/ enriching etc. – probably
could have done with ExecuteScript processor in hindsight) so am comfortable with going this
route if that is deemed more appropriate.


From: Bryan Bende <<>>
Reply-To: "<>" <<>>
Date: Thursday, 2 June 2016 at 13:12
To: "<>" <<>>
Subject: Re: Spark or custom processor?


I would think that you could do this all in NiFi.

How do the log files come into NiFi? TailFile, ListenUDP/ListenTCP, List+FetchFile?


On Thu, Jun 2, 2016 at 6:41 AM, Conrad Crampton <<>>
Any advice on ‘best’ architectural approach whereby some processing function has to be
applied to every flow file in a dataflow with some (possible) output based on flowfile content.
e.g. inspect log files for specific ip then send message to syslog

approach 1
Output port from NiFi -> Spark listens to that stream -> processes and outputs accordingly
Advantages – scale spark job on Yarn, decoupled (reusable) from NiFi
Disadvantages – adds complexity, decoupled from NiFi.

Approach 2
Custom processor -> PutSyslog
Advantages – reuse existing NiFi processors/ capability, obvious flow (design intent)
Disadvantages – scale??

Any comments/ advice/ experience of either approaches?


SecureData, combating cyber threats


The information contained in this message or any of its attachments may be privileged and
confidential and intended for the exclusive use of the intended recipient. If you are not
the intended recipient any disclosure, reproduction, distribution or other dissemination or
use of this communications is strictly prohibited. The views expressed in this email are those
of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only
valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address:
SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT

***This email originated outside SecureData***

Click here<!rwzg+4hlLPKdufvfzcRzpTaNxM9hG2QrA==>
to report this email as spam.

View raw message