Return-Path: X-Original-To: apmail-kafka-users-archive@www.apache.org Delivered-To: apmail-kafka-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C20F611F00 for ; Tue, 17 Jun 2014 17:45:10 +0000 (UTC) Received: (qmail 44898 invoked by uid 500); 17 Jun 2014 17:45:10 -0000 Delivered-To: apmail-kafka-users-archive@kafka.apache.org Received: (qmail 44869 invoked by uid 500); 17 Jun 2014 17:45:10 -0000 Mailing-List: contact users-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@kafka.apache.org Delivered-To: mailing list users@kafka.apache.org Received: (qmail 44855 invoked by uid 99); 17 Jun 2014 17:45:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jun 2014 17:45:09 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hsy541@gmail.com designates 209.85.219.42 as permitted sender) Received: from [209.85.219.42] (HELO mail-oa0-f42.google.com) (209.85.219.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jun 2014 17:45:05 +0000 Received: by mail-oa0-f42.google.com with SMTP id eb12so9265227oac.29 for ; Tue, 17 Jun 2014 10:44:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=w4uy4KuliFmDoetZOcbpiv1mVXRla4teaqhYJPkTNbo=; b=EIQfrNKCwtROHd1rN0D0351C4mQ9WQ9XoVYF3Ncj857i/e1Yg4eCX73TE+QltafNzh qOHySSzJbhbSSLKLmZFIqXVYcaJ+MYhqlQY3E1i7e5E0pcOoR8zkcJo7TXyTu7uyZcor pjltNLi8adJeYuz4TYce66HHeoA3R2zlSe1oapipdAKOxC3wBoddVmZOdWZNjPCGfsgc 1AzVuHkFkPMLYdQhe2Zdio4KJZgEdgEA3wJk/m13wtR8VGZCjg+GeMzquQYOSylDHtrd fTmYijO2IsaaTwnBV+x1IMzTCC8QnlSqO7nnxspKXi6GN0jA5qPrunvCRQlXAdDtDhJM cX/A== MIME-Version: 1.0 X-Received: by 10.182.126.47 with SMTP id mv15mr6649505obb.26.1403027084823; Tue, 17 Jun 2014 10:44:44 -0700 (PDT) Received: by 10.202.78.23 with HTTP; Tue, 17 Jun 2014 10:44:44 -0700 (PDT) In-Reply-To: References: Date: Tue, 17 Jun 2014 10:44:44 -0700 Message-ID: Subject: Re: Help is processing huge data through Kafka-storm cluster From: "hsy541@gmail.com" To: "users@kafka.apache.org" Content-Type: multipart/alternative; boundary=e89a8fb1f38200504804fc0bb329 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1f38200504804fc0bb329 Content-Type: text/plain; charset=UTF-8 Hi Shaikh, I heard some throughput bottleneck of storm. It cannot really scale up with kafka. I recommend you to try DataTorrent platform(https://www.datatorrent.com/) The platform itself is not open-source but it has a open-source library ( https://github.com/DataTorrent/Malhar) which contains a kafka ingestion functions. The library is pretty cool, it can scale up dynamically with kafka partitions and is fully HA. And in your case you might be able to use the platform for free.(It's free if your application doesn't require large amount of memory) With datatorrent platform and the open-source library I can scale my application up to 300k/s (10 nodes, 3 replica, 1kb msg, 0.8.0 client). I heard the performance of kafka client has been improved for 0.8.1 release :) Best, Siyuan On Sat, Jun 14, 2014 at 8:14 AM, Shaikh Ahmed wrote: > Hi, > > Daily we are downloaded 28 Million of messages and Monthly it goes up to > 800+ million. > > We want to process this amount of data through our kafka and storm cluster > and would like to store in HBase cluster. > > We are targeting to process one month of data in one day. Is it possible? > > We have setup our cluster thinking that we can process million of messages > in one sec as mentioned on web. Unfortunately, we have ended-up with > processing only 1200-1700 message per second. if we continue with this > speed than it will take min 10 days to process 30 days of data, which is > the relevant solution in our case. > > I suspect that we have to change some configuration to achieve this goal. > Looking for help from experts to support me in achieving this task. > > *Kafka Cluster:* > Kafka is running on two dedicated machines with 48 GB of RAM and 2TB of > storage. We have total 11 nodes kafka cluster spread across these two > servers. > > *Kafka Configuration:* > producer.type=async > compression.codec=none > request.required.acks=-1 > serializer.class=kafka.serializer.StringEncoder > queue.buffering.max.ms=100000 > batch.num.messages=10000 > queue.buffering.max.messages=100000 > default.replication.factor=3 > controlled.shutdown.enable=true > auto.leader.rebalance.enable=true > num.network.threads=2 > num.io.threads=8 > num.partitions=4 > log.retention.hours=12 > log.segment.bytes=536870912 > log.retention.check.interval.ms=60000 > log.cleaner.enable=false > > *Storm Cluster:* > Storm is running with 5 supervisor and 1 nimbus on IBM servers with 48 GB > of RAM and 8TB of storage. These servers are shared with hbase cluster. > > *Kafka spout configuration* > kafkaConfig.bufferSizeBytes = 1024*1024*8; > kafkaConfig.fetchSizeBytes = 1024*1024*4; > kafkaConfig.forceFromStart = true; > > *Topology: StormTopology* > Spout - Partition: 4 > First Bolt - parallelism hint: 6 and Num tasks: 5 > Second Bolt - parallelism hint: 5 > Third Bolt - parallelism hint: 3 > Fourth Bolt - parallelism hint: 3 and Num tasks: 4 > Fifth Bolt - parallelism hint: 3 > Sixth Bolt - parallelism hint: 3 > > *Supervisor configuration:* > > storm.local.dir: "/app/storm" > storm.zookeeper.port: 2181 > storm.cluster.mode: "distributed" > storm.local.mode.zmq: false > supervisor.slots.ports: > - 6700 > - 6701 > - 6702 > - 6703 > supervisor.worker.start.timeout.secs: 180 > supervisor.worker.timeout.secs: 30 > supervisor.monitor.frequency.secs: 3 > supervisor.heartbeat.frequency.secs: 5 > supervisor.enable: true > > storm.messaging.netty.server_worker_threads: 2 > storm.messaging.netty.client_worker_threads: 2 > storm.messaging.netty.buffer_size: 52428800 #50MB buffer > storm.messaging.netty.max_retries: 25 > storm.messaging.netty.max_wait_ms: 1000 > storm.messaging.netty.min_wait_ms: 100 > > > supervisor.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true" > worker.childopts: "-Xmx2048m -Djava.net.preferIPv4Stack=true" > > > Please let me know if more information needed.. > > Thanks in advance. > > Regards, > Riyaz > --e89a8fb1f38200504804fc0bb329--