Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 077B9200CB8 for ; Sat, 1 Jul 2017 18:16:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 061F9160BEA; Sat, 1 Jul 2017 16:16:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CC355160BD2 for ; Sat, 1 Jul 2017 18:16:37 +0200 (CEST) Received: (qmail 47219 invoked by uid 500); 1 Jul 2017 16:16:35 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 47208 invoked by uid 99); 1 Jul 2017 16:16:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Jul 2017 16:16:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id EEB2FC17D4 for ; Sat, 1 Jul 2017 16:16:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.104 X-Spam-Level: X-Spam-Status: No, score=0.104 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id waAohGAnZcpY for ; Sat, 1 Jul 2017 16:16:32 +0000 (UTC) Received: from mail-wr0-f177.google.com (mail-wr0-f177.google.com [209.85.128.177]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 345585F6CB for ; Sat, 1 Jul 2017 16:16:32 +0000 (UTC) Received: by mail-wr0-f177.google.com with SMTP id 77so216514374wrb.1 for ; Sat, 01 Jul 2017 09:16:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=wJ+ijeSPSp8EL6wqAWU4S+HW+jWlIfPWGqiNCoGGBGg=; b=m8rLca1DbCCxv+9vvukz1gJnDHArXdYi0J4aH7rKMG80KMlRL985vZNZru7TchSLBi hNHZo2FDt+whNuBWiaN5RnQdcgA1Nu+LCxYg6aBcNzrBSBS/342HqGOMQEiFhtheyZ7n VxPdVWYnr4ohQE6ED8A/6hJ4EEt6/Yt16W5G9b/zwE30OUCQYHGWDpiQIaF4j8osKpvT cX6JTI1gLEQN9JztJEj6BWHUvo92HggVFSrXlcVUF9xQfrriNpO3YrwkmEJoA44oD4Id BFrMkFScPm96QtbOfifGZXXe/NAR93LtCUtQrMRh18MiNGNT9Zs9dHMJZP4WrlLZQwfm uzwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=wJ+ijeSPSp8EL6wqAWU4S+HW+jWlIfPWGqiNCoGGBGg=; b=Nm8wQu4XKr4z0kM3RAVn4HAXkIV3+zq6NMvaqYgumiyxdnClFhuTeqm6+6MjxzmMxg wYMglok66F7TWaok2C1go8AwkNOD1WD+SxPFNagc/g/4Cu2UsKaqprYie0pasajXUMwz TqvE/Kk56wQdFszqXMcfnkmmnUarUo7CLVXfmQQ5pdxwVEcdqhvdnjV1Wz66HFHtQbTn oLGT70WR/IKyAgwoXyAtW0Cr3kWTf1Qu3noMfCBuFEpndoll2pKDzkUA2OYLeier7P4w f65FsJD1AUMQ1h3iG+vBEm0YBtz6omxEev3uFl7TtDxRCmigS8rZy8fkoevL9Vchnb/Y SFIw== X-Gm-Message-State: AKS2vOwtWd+2EXCCPKVuA5uM/rG6L8iNL7vFtPBJSf1ZGxAfS5TEfWWT G8ElmNQwrsBWDqnB9O+eWXaiI0q27Q== X-Received: by 10.223.128.42 with SMTP id 39mr16034222wrk.175.1498925791817; Sat, 01 Jul 2017 09:16:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.175.85 with HTTP; Sat, 1 Jul 2017 09:16:31 -0700 (PDT) In-Reply-To: References: <021801d2f15a$2e2c94f0$8a85bed0$@altruistindia.com> From: Gagan Brahmi Date: Sat, 1 Jul 2017 09:16:31 -0700 Message-ID: Subject: Re: Kafka or Flume To: Sidharth Kumar Cc: daemeon reiydelle , Mallanagouda Patil , Maggy , Sudeep Singh Thakur , JP gupta , "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary="94eb2c0598c25db873055343db2c" archived-at: Sat, 01 Jul 2017 16:16:39 -0000 --94eb2c0598c25db873055343db2c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'd say the data flow should be simpler since you might need some basic verification of the data. You may want to include NiFi in the mix which should do the job. It can look something like this: For ingestion NiFi -> Kafka For data verification Kafka -> NiFi -> HDFS/Hive/HBase Regards, Gagan Brahmi On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar wrote: > Thanks for your suggestions. I feel kafka will be better but need some > extra like either kafka with flume or kafka with spark streaming. Can you > kindly suggest which will be better and in which situation which > combination will perform best. > > Thanks in advance for your help. > > Warm Regards > > Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367 > | LinkedIn:www.linkedin.com/in/sidharthkumar2792 > > > > > > > On 30-Jun-2017 11:18 AM, "daemeon reiydelle" wrote: > >> For fairly simple transformations, Flume is great, and works fine >> subscribing >> =E2=80=8Bto some pretty =E2=80=8B >> high volumes of messages from Kafka >> =E2=80=8B (I think we hit 50M/second at one point)=E2=80=8B >> . If you need to do complex transformations, e.g. database lookups for >> the Kafka to Hadoop ETL, then you will start having complexity issues wh= ich >> will exceed the capability of Flume. >> =E2=80=8BThere are git repos that have everything you need, which includ= e the >> kafka adapter, hdfs writer, etc. A lot of this is built into flume. =E2= =80=8B >> I assume this might be a bit off topic, so googling flume & kafka will >> help you? >> >> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil < >> mallanagouda.c.patil@gmail.com> wrote: >> >>> Kafka is capable of processing billions of events per second. You can >>> scale it horizontally with Kafka broker servers. >>> >>> You can try out these steps >>> >>> 1. Create a topic in Kafka to get your all data. You have to use Kafka >>> producer to ingest data into Kafka. >>> 2. If you are going to write your own HDFS client to put data into HDFS >>> then, you can read data from topic in step-1, validate and store into H= DFS. >>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS >>> connector) to put data into HDFS then >>> Write tool to read data from topic, validate and store in other topic. >>> >>> We are using combination of these steps to process over 10 million >>> events/second. >>> >>> I hope it helps.. >>> >>> Thanks >>> Mallan >>> >>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" >>> wrote: >>> >>>> Thanks! What about Kafka with Flume? And also I would like to tell tha= t >>>> everyday data intake is in millions and can't afford to loose even a s= ingle >>>> piece of data. Which makes a need of high availablity. >>>> >>>> Warm Regards >>>> >>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 >>>> 367 | LinkedIn:www.linkedin.com/in/sidharthkumar2792 >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 30-Jun-2017 10:04 AM, "JP gupta" wrote= : >>>> >>>>> The ideal sequence should be: >>>>> >>>>> 1. Ingress using Kafka -> Validation and processing using Spark >>>>> -> Write into any NoSql DB or Hive. >>>>> >>>>> From my recent experience, writing directly to HDFS can be slow >>>>> depending on the data format. >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> JP >>>>> >>>>> >>>>> >>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com] >>>>> *Sent:* 30 June 2017 09:26 >>>>> *To:* Sidharth Kumar >>>>> *Cc:* Maggy; common-user@hadoop.apache.org >>>>> *Subject:* Re: Kafka or Flume >>>>> >>>>> >>>>> >>>>> In your use Kafka would be better because you want some >>>>> transformations and validations. >>>>> >>>>> Kind regards, >>>>> Sudeep Singh Thakur >>>>> >>>>> >>>>> >>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> I have a requirement where I have all transactional data injestion >>>>> into hadoop in real time and before storing the data into hadoop, pro= cess >>>>> it to validate the data. If the data failed to pass validation proces= s , it >>>>> will not be stored into hadoop. The validation process also make use = of >>>>> historical data which is stored in hadoop. So, my question is which >>>>> injestion tool will be best for this Kafka or Flume? >>>>> >>>>> >>>>> >>>>> Any suggestions will be a great help for me. >>>>> >>>>> >>>>> Warm Regards >>>>> >>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 >>>>> 367 | LinkedIn:www.linkedin.com/in/sidharthkumar2792 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >> --94eb2c0598c25db873055343db2c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I'd say the data flow should be simpler since you migh= t need some basic verification of the data. You may want to include NiFi in= the mix which should do the job.

It can look something = like this:

For ingestion

NiFi -> = Kafka

For data verification

Kafka -> NiFi -> HDFS/Hive/HBase


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 7:26 AM, Sidharth K= umar <sidharthkumar2707@gmail.com> wrote:
Thanks for your suggestions. I = feel kafka will be better but need some extra like either kafka with flume = or kafka with spark streaming. Can you kindly suggest which will be better = and in which situation which combination will perform best.

Thanks in advance for your help.

Warm Reg= ards

Sidharth Kumar |=C2=A0Mob: +91 8197 555 599/7892 192 367= | =C2=A0LinkedIn:www.linkedin.com/in/sidharthkumar2792



=C2=A0=C2=A0=C2=A0
=

On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <daemeonr@gmail.com> wrote:<= br type=3D"attribution">
For fairly simple transformations, Flume is great, = and works fine subscribing
=E2=80=8Bto s= ome pretty =E2=80=8B
high volumes of messages from Kafka
=E2=80=8B (I think we hit 50M/second at one point)=E2= =80=8B
. If you need to do complex transformations, e.g. database look= ups for the Kafka to Hadoop ETL, then you will start having complexity issu= es which will exceed the capability of Flume.
=E2=80=8BThere are git repos that have everything you need, which inc= lude the kafka adapter, hdfs writer, etc. A lot of this is built into flume= . =E2=80=8B
I assume this might be a bit off topic, so googling flume = & kafka will help you?

On Thu, Jun 2= 9, 2017 at 10:14 PM, Mallanagouda Patil <mallanagouda.c.patil= @gmail.com> wrote:
Kafka is capable of processing billions of events per second= . You can scale it horizontally with Kafka broker servers.=C2=A0

You can try out these steps

1. Create a topic in Kafka to get yo= ur all data. You have to use Kafka producer to ingest data into Kafka.
2. If you are going to write your own HDFS client to put= data into HDFS then, you can read data from topic in step-1, validate and = store into HDFS.
3. If you want to OpenSource tool (= Gobbling or confluent Kafka HDFS connector) to put data into HDFS then
Write tool to read data from topic, validate and store i= n other topic.
=C2=A0=C2=A0
W= e are using combination of these steps to process over 10 million events/se= cond.

I hope it helps..<= /div>

Thanks
Mallan

On Jun 30, 2017 10:= 31 AM, "Sidharth Kumar" <sidharthkumar2707@gmail.com> wrote:
Thanks= ! What about Kafka with Flume? And also I would like to tell that everyday = data intake is in millions and can't afford to loose even a single piec= e of data. Which makes a need of =C2=A0high availablity.
<= div dir=3D"auto">

W= arm Regards

Sidharth Kumar |=C2=A0Mob: +91 8197 555 599/7892 = 192 367 | =C2=A0LinkedIn:www.linkedin.com/in/sidharthkumar2792
=



=C2=A0=C2=A0=C2=A0

On 30-Jun-2017 10:04 AM, "JP= gupta" <JP.Gupta@altruistindia.com> wrote:

The ideal sequence should be:<= u>

1.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Ingress using Kafka -> Valid= ation and processing using Spark -> Write into any NoSql DB or Hive.=C2= =A0

From my recent experience, writing directly to HDFS can be slow dependi= ng on the data format.

=C2=A0

Thanks

JP

=C2=A0

From: Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
S= ent: 30 June 2017 09:26
To: Sidharth Kumar
Cc: Magg= y; commo= n-user@hadoop.apache.org
Subject: Re: Kafka or Flume

=C2=A0

In your use Kafka would be b= etter because you want some transformations and validations.<= /p>

Kind regards,
Sudeep Singh Thakur<= u>

=C2=A0

<= div>

On Jun 30, 2017 8:57 AM, "Sidharth Kumar&qu= ot; <si= dharthkumar2707@gmail.com> wrote:

Hi,

=C2=A0

I have a requirement where I hav= e all transactional data injestion into hadoop in real time and before stor= ing the data into hadoop, process it to validate the data. If the data fail= ed to pass validation process , it will not be stored into hadoop. The vali= dation process also make use of historical data which is stored in hadoop. = So, my question is which injestion tool will be best for this Kafka or Flum= e?

=C2=A0<= /p>

Any sug= gestions will be a great help for me.


Warm Regards

Sidharth Kumar |=C2=A0Mob: +91 8197 555 = 599/7892 192 367 | =C2=A0LinkedIn:www.linkedin.com/in/sidharthkuma= r2792




=C2=A0=C2=A0=C2=A0



--94eb2c0598c25db873055343db2c--