Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CE711733B for ; Thu, 30 Oct 2014 14:34:02 +0000 (UTC) Received: (qmail 18201 invoked by uid 500); 30 Oct 2014 14:33:57 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 18078 invoked by uid 500); 30 Oct 2014 14:33:57 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 18068 invoked by uid 99); 30 Oct 2014 14:33:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2014 14:33:56 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cgreen@conductor.com designates 65.55.169.90 as permitted sender) Received: from [65.55.169.90] (HELO na01-bl2-obe.outbound.protection.outlook.com) (65.55.169.90) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2014 14:33:52 +0000 Received: from CY1PR0201MB0730.namprd02.prod.outlook.com (25.160.141.145) by CY1PR0201MB0939.namprd02.prod.outlook.com (25.160.165.148) with Microsoft SMTP Server (TLS) id 15.1.11.14; Thu, 30 Oct 2014 14:32:26 +0000 Received: from CY1PR0201MB0730.namprd02.prod.outlook.com (25.160.141.145) by CY1PR0201MB0730.namprd02.prod.outlook.com (25.160.141.145) with Microsoft SMTP Server (TLS) id 15.1.6.9; Thu, 30 Oct 2014 14:32:25 +0000 Received: from CY1PR0201MB0730.namprd02.prod.outlook.com ([25.160.141.145]) by CY1PR0201MB0730.namprd02.prod.outlook.com ([25.160.141.145]) with mapi id 15.01.0006.000; Thu, 30 Oct 2014 14:32:25 +0000 From: Casey Green To: "user@hadoop.apache.org" Subject: A more scalable Kafka to Hadoop InputFormat Thread-Topic: A more scalable Kafka to Hadoop InputFormat Thread-Index: AQHP9E5Si22eEvXQtkecfmLYBlhTNA== Date: Thu, 30 Oct 2014 14:32:25 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [72.43.138.234] x-microsoft-antispam: BCL:0;PCL:0;RULEID:;SRVR:CY1PR0201MB0730;UriScan:; x-forefront-prvs: 038002787A x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(164054003)(189002)(26614003)(199003)(2656002)(95666004)(99286002)(19617315012)(46102003)(40100003)(31966008)(21056001)(105586002)(122556002)(66066001)(106116001)(80022003)(85852003)(106356001)(97736003)(77096002)(64706001)(87936001)(16236675004)(92726001)(2501002)(20776003)(4396001)(229853001)(86362001)(36756003)(19580395003)(101416001)(2351001)(107886001)(15202345003)(110136001)(92566001)(85306004)(76482002)(50986999)(107046002)(15975445006)(120916001)(54356999);DIR:OUT;SFP:1101;SCL:1;SRVR:CY1PR0201MB0730;H:CY1PR0201MB0730.namprd02.prod.outlook.com;FPR:;MLV:sfv;PTR:InfoNoRecords;MX:1;A:1;LANG:en; Content-Type: multipart/alternative; boundary="_000_D077C43649C79cgreenconductorcom_" MIME-Version: 1.0 X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:;SRVR:CY1PR0201MB0939; X-OriginatorOrg: conductor.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_D077C43649C79cgreenconductorcom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi Folks, I'm open sourcing a scalable Kafka InputFormat. As far as I know or am awa= re of, my version is unique compared to other Kafka InputFormats out there,= in that input splits are mapped to Kafka log files, rather than entire Kaf= ka partitions. Mapping Kafka log files to input splits scales your Map/Red= uce job by the amount of data left to consume in a queue, whereas mapping i= nput splits to entire partitions always gives you a constant number of inpu= t splits. I wrote up a blog post about it here, and the source code for my KafkaI= nputFormat is on github. Your quest= ions, comments and feedback are welcomed and much appreciated! Thanks, Casey Green --_000_D077C43649C79cgreenconductorcom_ Content-Type: text/html; charset="iso-8859-1" Content-ID: <0F3F020A456A3D4A960AB051F7BA187C@namprd02.prod.outlook.com> Content-Transfer-Encoding: quoted-printable
Hi Folks,

I’m open sourcing a scalable Kafka InputFormat.  As far as = I know or am aware of, my version is unique compared to other Kafka InputFo= rmats out there, in that input splits are mapped to Kafka log files, rather= than entire Kafka partitions.  Mapping Kafka log files to input splits scales your Map/Reduce job by the amount of data= left to consume in a queue, whereas mapping input splits to entire partiti= ons always gives you a constant number of input splits.

I wrote up a blog post about it here, and = ;the source code for my K= afkaInputFormat is on github.  Your questions, comments and feedback are welcomed and much appreciated!

Thanks,
Casey Green

--_000_D077C43649C79cgreenconductorcom_--