Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAD+SxekUBWUHkC1x=Ne-nEX6wCqvosTb+FsnK7vQGpGfuE9MTA@mail.gmail.com>
References: <CAD+SxemzBd77otPdh3v+kk0Kg=qzvhJv1r-7g6Bh9Y56yQWB5w@mail.gmail.com>
 <CAD+SxekXnhA9h2TfTVEZomKYpAnZj1b3W5yLg+HH6hVcFqwsjQ@mail.gmail.com>
 <CAD+Sxenv3UUR3=jtrLrkmfKVxa=QOqZrjKUKev+F-v5yhTksiA@mail.gmail.com>
 <CAD+Sxe=tDeiw1LfoAMh4fKMxqfPFfg5coZJT_OjgHDta8Pmrpw@mail.gmail.com>
 <CAD+SxemQdY2KXGAipNWq0btkbgruTHrWKjd+iVVqs2C5VzxyBw@mail.gmail.com>
 <CAD+Sxen_EPadO6Ad7EruDGbD41Bp=d2GfKUPO4sJ8ByOAFwxPQ@mail.gmail.com>
 <CAD+SxemjKdpGX5O7NZeNLJBU2xd-guWy=eO==KKFxeOTHuvUwA@mail.gmail.com>
 <CAD+Sxemoq2i8a+bnB9FSCrA18_dGew1A+C9dMuz8UEjS0_aEPg@mail.gmail.com>
 <CAD+Sxe=Pv4P0ZOpRg8bKh_dGreKZRBnxFKA0E2bCHtFZfjkxow@mail.gmail.com>
 <CAD+SxemLBL1jN-VNTNSTDFarSN1Di61ZZAHeXGV-LMoDSRHN7A@mail.gmail.com>
 <CAD+Sxem007gu7=ywDsjJesWgWUELF48nrQWgarqHiiwrCKw1TA@mail.gmail.com>
 <CAD+Sxe=xK_P=ftVpGdg=oGJg4x=bEvmVXcOOUDMukW980zMaVg@mail.gmail.com>
 <CAD+SxekdF7av=8vYENFqSDzJD9XoztWFk4N8krCwM90aFeMoQw@mail.gmail.com>
 <CAD+SxemV6CaxayDRyZ31gG21CtrYixuvSp4mVceT6ZaJEXCxVA@mail.gmail.com>
 <CAFmCLrsFB2Wmg2tfDs=scrEq5YwfeHCCXCC=-+KQkigOPrATKQ@mail.gmail.com>
 <021801d2f15a$2e2c94f0$8a85bed0$@altruistindia.com> <CAD+Sxe=FyvZXe_jV7q-nQgW1mxUykbCXzN17emn6E+8NvUkmDQ@mail.gmail.com>
 <CABjHObXiAtuqZY7xgUQTTm0-hM1_KXLPEfSdD7RAmWb0VVxZfw@mail.gmail.com>
 <CAOUOv0G4dxyTU084LdK95TzfnMjrrV4bXSKisA=S_FrFR0zX6A@mail.gmail.com> <CAD+SxekUBWUHkC1x=Ne-nEX6wCqvosTb+FsnK7vQGpGfuE9MTA@mail.gmail.com>
From: Gagan Brahmi <gaganbrahmi@gmail.com>
Date: Sat, 1 Jul 2017 09:16:31 -0700
Message-ID: <CALqDVuYdBw3TM-yOc3YyCXFJqZasNVnj-AZtCc9w_+vbtqvBEA@mail.gmail.com>
Subject: Re: Kafka or Flume
To: Sidharth Kumar <sidharthkumar2707@gmail.com>
Cc: daemeon reiydelle <daemeonr@gmail.com>, Mallanagouda Patil <mallanagouda.c.patil@gmail.com>,
	Maggy <catchmaggy@gmail.com>, Sudeep Singh Thakur <sudeepthakur90@gmail.com>,
	JP gupta <JP.Gupta@altruistindia.com>,
	"common-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary="94eb2c0598c25db873055343db2c"
archived-at: Sat, 01 Jul 2017 16:16:39 -0000

--94eb2c0598c25db873055343db2c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I'd say the data flow should be simpler since you might need some basic
verification of the data. You may want to include NiFi in the mix which
should do the job.

It can look something like this:

For ingestion

NiFi -> Kafka

For data verification

Kafka -> NiFi -> HDFS/Hive/HBase


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <sidharthkumar2707@gmail.com=
>
wrote:

> Thanks for your suggestions. I feel kafka will be better but need some
> extra like either kafka with flume or kafka with spark streaming. Can you
> kindly suggest which will be better and in which situation which
> combination will perform best.
>
> Thanks in advance for your help.
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <daemeonr@gmail.com> wrote:
>
>> For fairly simple transformations, Flume is great, and works fine
>> subscribing
>> =E2=80=8Bto some pretty =E2=80=8B
>> high volumes of messages from Kafka
>> =E2=80=8B (I think we hit 50M/second at one point)=E2=80=8B
>> . If you need to do complex transformations, e.g. database lookups for
>> the Kafka to Hadoop ETL, then you will start having complexity issues wh=
ich
>> will exceed the capability of Flume.
>> =E2=80=8BThere are git repos that have everything you need, which includ=
e the
>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. =E2=
=80=8B
>> I assume this might be a bit off topic, so googling flume & kafka will
>> help you?
>>
>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>> mallanagouda.c.patil@gmail.com> wrote:
>>
>>> Kafka is capable of processing billions of events per second. You can
>>> scale it horizontally with Kafka broker servers.
>>>
>>> You can try out these steps
>>>
>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>> producer to ingest data into Kafka.
>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>> then, you can read data from topic in step-1, validate and store into H=
DFS.
>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>> connector) to put data into HDFS then
>>> Write tool to read data from topic, validate and store in other topic.
>>>
>>> We are using combination of these steps to process over 10 million
>>> events/second.
>>>
>>> I hope it helps..
>>>
>>> Thanks
>>> Mallan
>>>
>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <sidharthkumar2707@gmail.com=
>
>>> wrote:
>>>
>>>> Thanks! What about Kafka with Flume? And also I would like to tell tha=
t
>>>> everyday data intake is in millions and can't afford to loose even a s=
ingle
>>>> piece of data. Which makes a need of  high availablity.
>>>>
>>>> Warm Regards
>>>>
>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 30-Jun-2017 10:04 AM, "JP gupta" <JP.Gupta@altruistindia.com> wrote=
:
>>>>
>>>>> The ideal sequence should be:
>>>>>
>>>>> 1.      Ingress using Kafka -> Validation and processing using Spark
>>>>> -> Write into any NoSql DB or Hive.
>>>>>
>>>>> From my recent experience, writing directly to HDFS can be slow
>>>>> depending on the data format.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> JP
>>>>>
>>>>>
>>>>>
>>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
>>>>> *Sent:* 30 June 2017 09:26
>>>>> *To:* Sidharth Kumar
>>>>> *Cc:* Maggy; common-user@hadoop.apache.org
>>>>> *Subject:* Re: Kafka or Flume
>>>>>
>>>>>
>>>>>
>>>>> In your use Kafka would be better because you want some
>>>>> transformations and validations.
>>>>>
>>>>> Kind regards,
>>>>> Sudeep Singh Thakur
>>>>>
>>>>>
>>>>>
>>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <sidharthkumar2707@gmail.co=
m>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I have a requirement where I have all transactional data injestion
>>>>> into hadoop in real time and before storing the data into hadoop, pro=
cess
>>>>> it to validate the data. If the data failed to pass validation proces=
s , it
>>>>> will not be stored into hadoop. The validation process also make use =
of
>>>>> historical data which is stored in hadoop. So, my question is which
>>>>> injestion tool will be best for this Kafka or Flume?
>>>>>
>>>>>
>>>>>
>>>>> Any suggestions will be a great help for me.
>>>>>
>>>>>
>>>>> Warm Regards
>>>>>
>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>

--94eb2c0598c25db873055343db2c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;d say the data flow should be simpler since you migh=
t need some basic verification of the data. You may want to include NiFi in=
 the mix which should do the job.<div><br></div><div>It can look something =
like this:</div><div><br>For ingestion</div><div><br></div><div>NiFi -&gt; =
Kafka</div><div><br></div><div>For data verification</div><div><br></div><d=
iv>Kafka -&gt; NiFi -&gt; HDFS/Hive/HBase</div><div><br></div><div><br></di=
v><div>Regards,</div><div>Gagan Brahmi</div></div><div class=3D"gmail_extra=
"><br><div class=3D"gmail_quote">On Sat, Jul 1, 2017 at 7:26 AM, Sidharth K=
umar <span dir=3D"ltr">&lt;<a href=3D"mailto:sidharthkumar2707@gmail.com" t=
arget=3D"_blank">sidharthkumar2707@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex"><div dir=3D"auto">Thanks for your suggestions. I =
feel kafka will be better but need some extra like either kafka with flume =
or kafka with spark streaming. Can you kindly suggest which will be better =
and in which situation which combination will perform best.<div dir=3D"auto=
"><br></div><div dir=3D"auto">Thanks in advance for your help.<span class=
=3D""><br><div data-smartmail=3D"gmail_signature" dir=3D"auto"><br>Warm Reg=
ards<br><br>Sidharth Kumar |=C2=A0Mob: <a href=3D"tel:+91%2081975%2055599" =
value=3D"+918197555599" target=3D"_blank">+91 8197 555 599</a>/7892 192 367=
 | =C2=A0LinkedIn:<a href=3D"http://www.linkedin.com/in/sidharthkumar2792" =
target=3D"_blank">www.linkedin.com/in/<wbr>sidharthkumar2792</a><br><br><br=
><br><br>=C2=A0=C2=A0=C2=A0 </div></span></div></div><div class=3D"HOEnZb">=
<div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
>On 30-Jun-2017 11:18 AM, &quot;daemeon reiydelle&quot; &lt;<a href=3D"mail=
to:daemeonr@gmail.com" target=3D"_blank">daemeonr@gmail.com</a>&gt; wrote:<=
br type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><di=
v class=3D"gmail_extra">For fairly simple transformations, Flume is great, =
and works fine subscribing <div class=3D"gmail_default" style=3D"font-famil=
y:comic sans ms,sans-serif;color:rgb(7,55,99);display:inline">=E2=80=8Bto s=
ome pretty =E2=80=8B</div>high volumes of messages from Kafka<div class=3D"=
gmail_default" style=3D"font-family:comic sans ms,sans-serif;color:rgb(7,55=
,99);display:inline">=E2=80=8B (I think we hit 50M/second at one point)=E2=
=80=8B</div>. If you need to do complex transformations, e.g. database look=
ups for the Kafka to Hadoop ETL, then you will start having complexity issu=
es which will exceed the capability of Flume. <div class=3D"gmail_default" =
style=3D"font-family:comic sans ms,sans-serif;color:rgb(7,55,99);display:in=
line">=E2=80=8BThere are git repos that have everything you need, which inc=
lude the kafka adapter, hdfs writer, etc. A lot of this is built into flume=
. =E2=80=8B</div>I assume this might be a bit off topic, so googling flume =
&amp; kafka will help you? <br><br><div class=3D"gmail_quote">On Thu, Jun 2=
9, 2017 at 10:14 PM, Mallanagouda Patil <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:mallanagouda.c.patil@gmail.com" target=3D"_blank">mallanagouda.c.patil=
@gmail.co<wbr>m</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"auto">Kafka is capable of processing billions of events per second=
. You can scale it horizontally with Kafka broker servers.=C2=A0<div dir=3D=
"auto"><br></div><div dir=3D"auto">You can try out these steps</div><div di=
r=3D"auto"><br></div><div dir=3D"auto">1. Create a topic in Kafka to get yo=
ur all data. You have to use Kafka producer to ingest data into Kafka.</div=
><div dir=3D"auto">2. If you are going to write your own HDFS client to put=
 data into HDFS then, you can read data from topic in step-1, validate and =
store into HDFS.</div><div dir=3D"auto">3. If you want to OpenSource tool (=
Gobbling or confluent Kafka HDFS connector) to put data into HDFS then</div=
><div dir=3D"auto">Write tool to read data from topic, validate and store i=
n other topic.</div><div dir=3D"auto">=C2=A0=C2=A0</div><div dir=3D"auto">W=
e are using combination of these steps to process over 10 million events/se=
cond.</div><div dir=3D"auto"><br></div><div dir=3D"auto">I hope it helps..<=
/div><div dir=3D"auto"><br></div><div dir=3D"auto">Thanks</div><div dir=3D"=
auto">Mallan</div></div><div class=3D"m_-129137086725894263m_72790915469855=
09413HOEnZb"><div class=3D"m_-129137086725894263m_7279091546985509413h5"><d=
iv class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Jun 30, 2017 10:=
31 AM, &quot;Sidharth Kumar&quot; &lt;<a href=3D"mailto:sidharthkumar2707@g=
mail.com" target=3D"_blank">sidharthkumar2707@gmail.com</a>&gt; wrote:<br t=
ype=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"auto">Thanks=
! What about Kafka with Flume? And also I would like to tell that everyday =
data intake is in millions and can&#39;t afford to loose even a single piec=
e of data. Which makes a need of =C2=A0high availablity.<div dir=3D"auto"><=
div dir=3D"auto"><div data-smartmail=3D"gmail_signature" dir=3D"auto"><br>W=
arm Regards<br><br>Sidharth Kumar |=C2=A0Mob: <a href=3D"tel:+91%2081975%20=
55599" value=3D"+918197555599" target=3D"_blank">+91 8197 555 599</a>/7892 =
192 367 | =C2=A0LinkedIn:<a href=3D"http://www.linkedin.com/in/sidharthkuma=
r2792" target=3D"_blank">www.linkedin.com/in/<wbr>sidharthkumar2792</a><br>=
<br><br><br><br>=C2=A0=C2=A0=C2=A0 </div></div></div></div><div class=3D"gm=
ail_extra"><br><div class=3D"gmail_quote">On 30-Jun-2017 10:04 AM, &quot;JP=
 gupta&quot; &lt;<a href=3D"mailto:JP.Gupta@altruistindia.com" target=3D"_b=
lank">JP.Gupta@altruistindia.com</a>&gt; wrote:<br type=3D"attribution"><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div link=3D"blue" vlink=3D"purple" lang=3D"EN-=
IN"><div class=3D"m_-129137086725894263m_7279091546985509413m_8570708823578=
557852m_-1397494671311308722m_5485952012112940210WordSection1"><p class=3D"=
MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,=
&quot;sans-serif&quot;;color:#1f497d">The ideal sequence should be</span>:<=
u></u><u></u></p><p class=3D"MsoNormal"> <u></u><u></u></p><p class=3D"m_-1=
29137086725894263m_7279091546985509413m_8570708823578557852m_-1397494671311=
308722m_5485952012112940210MsoListParagraph"><u></u><span>1.<span style=3D"=
font:7.0pt &quot;Times New Roman&quot;">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 </sp=
an></span><u></u><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&=
quot;,&quot;sans-serif&quot;;color:#1f497d">Ingress using Kafka -&gt; Valid=
ation and processing using Spark -&gt; Write into any NoSql DB or Hive.=C2=
=A0 <u></u><u></u></span></p><p class=3D"MsoNormal"><span style=3D"font-siz=
e:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f49=
7d">From my recent experience, writing directly to HDFS can be slow dependi=
ng on the data format.<u></u><u></u></span></p><p class=3D"MsoNormal"><span=
 style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif=
&quot;;color:#1f497d"><u></u>=C2=A0<u></u></span></p><p class=3D"MsoNormal"=
><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans=
-serif&quot;;color:#1f497d">Thanks<u></u><u></u></span></p><p class=3D"MsoN=
ormal"><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quo=
t;sans-serif&quot;;color:#1f497d">JP <u></u><u></u></span></p><p class=3D"M=
soNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&=
quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></span></p><p clas=
s=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot;Tahoma=
&quot;,&quot;sans-serif&quot;" lang=3D"EN-US">From:</span></b><span style=
=3D"font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"=
 lang=3D"EN-US"> Sudeep Singh Thakur [mailto:<a href=3D"mailto:sudeepthakur=
90@gmail.com" target=3D"_blank">sudeepthakur90@gmail.c<wbr>om</a>] <br><b>S=
ent:</b> 30 June 2017 09:26<br><b>To:</b> Sidharth Kumar<br><b>Cc:</b> Magg=
y; <a href=3D"mailto:common-user@hadoop.apache.org" target=3D"_blank">commo=
n-user@hadoop.apache.org</a><br><b>Subject:</b> Re: Kafka or Flume<u></u><u=
></u></span></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div><p clas=
s=3D"MsoNormal" style=3D"margin-bottom:12.0pt">In your use Kafka would be b=
etter because you want some transformations and validations.<u></u><u></u><=
/p><div><p class=3D"MsoNormal">Kind regards,<br>Sudeep Singh Thakur<u></u><=
u></u></p></div></div><div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><=
div><p class=3D"MsoNormal">On Jun 30, 2017 8:57 AM, &quot;Sidharth Kumar&qu=
ot; &lt;<a href=3D"mailto:sidharthkumar2707@gmail.com" target=3D"_blank">si=
dharthkumar2707@gmail.com</a>&gt; wrote:<u></u><u></u></p><div><p class=3D"=
MsoNormal">Hi,<u></u><u></u></p><div><p class=3D"MsoNormal"><u></u>=C2=A0<u=
></u></p></div><div><p class=3D"MsoNormal">I have a requirement where I hav=
e all transactional data injestion into hadoop in real time and before stor=
ing the data into hadoop, process it to validate the data. If the data fail=
ed to pass validation process , it will not be stored into hadoop. The vali=
dation process also make use of historical data which is stored in hadoop. =
So, my question is which injestion tool will be best for this Kafka or Flum=
e?<u></u><u></u></p></div><div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u><=
/p></div><div><p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">Any sug=
gestions will be a great help for me.<u></u><u></u></p><div><p class=3D"Mso=
Normal"><br>Warm Regards<br><br>Sidharth Kumar |=C2=A0Mob: <a href=3D"tel:+=
91%2081975%2055599" value=3D"+918197555599" target=3D"_blank">+91 8197 555 =
599</a>/7892 192 367 | =C2=A0LinkedIn:<a href=3D"http://www.linkedin.com/in=
/sidharthkumar2792" target=3D"_blank">www.linkedin.com/in/<wbr>sidharthkuma=
r2792</a><br><br><br><br><br>=C2=A0=C2=A0=C2=A0 <u></u><u></u></p></div></d=
iv></div></div></div></div></div></blockquote></div></div>
</blockquote></div></div>
</div></div></blockquote></div><br></div></div>
</blockquote></div></div>
</div></div></blockquote></div><br></div>

--94eb2c0598c25db873055343db2c--