Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED5FF10C87 for ; Thu, 22 Aug 2013 01:42:22 +0000 (UTC) Received: (qmail 21333 invoked by uid 500); 22 Aug 2013 01:42:22 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 21274 invoked by uid 500); 22 Aug 2013 01:42:22 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 21266 invoked by uid 99); 22 Aug 2013 01:42:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Aug 2013 01:42:22 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ewasserman@247-inc.com designates 216.32.181.181 as permitted sender) Received: from [216.32.181.181] (HELO ch1outboundpool.messaging.microsoft.com) (216.32.181.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Aug 2013 01:42:15 +0000 Received: from mail4-ch1-R.bigfish.com (10.43.68.251) by CH1EHSOBE022.bigfish.com (10.43.70.79) with Microsoft SMTP Server id 14.1.225.22; Thu, 22 Aug 2013 01:41:53 +0000 Received: from mail4-ch1 (localhost [127.0.0.1]) by mail4-ch1-R.bigfish.com (Postfix) with ESMTP id 8F0351A00F2; Thu, 22 Aug 2013 01:41:53 +0000 (UTC) X-Forefront-Antispam-Report: CIP:111.221.112.37;KIP:(null);UIP:(null);IPV:NLI;H:HKNPRD0310HT002.apcprd03.prod.outlook.com;RD:none;EFVD:NLI X-SpamScore: -3 X-BigFish: PS-3(zz98dI9371I1432Izz208ch1ee6h1fdah2073h1202h1e76h1d2ah1fc6hzz1de098h1de096h8275bh8275dh1de097hz2fh2a8h839h944hd25he5bhf0ah1220h1288h12a5h12a9h12bdh137ah13b6h1441h1504h1537h153bh162dh1631h1758h18e1h1946h19b5h19ceh1ad9h1b0ah1d0ch1d2eh1d3fh1dfeh1dffh1e1dh1fe8h1ff5h2052h1155h) Received-SPF: pass (mail4-ch1: domain of 247-inc.com designates 111.221.112.37 as permitted sender) client-ip=111.221.112.37; envelope-from=ewasserman@247-inc.com; helo=HKNPRD0310HT002.apcprd03.prod.outlook.com ;.outlook.com ; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM;SFS:(377454003)(189002)(199002)(24454002)(51704005)(46102001)(4396001)(83322001)(80976001)(19580395003)(19580405001)(51856001)(81542001)(82746002)(69226001)(47736001)(50986001)(47446002)(47976001)(49866001)(31966008)(81342001)(74662001)(19580385001)(65816001)(66066001)(74366001)(63696002)(56776001)(81816001)(54316002)(33656001)(74876001)(59766001)(56816003)(74706001)(81686001)(79102001)(77096001)(77982001)(83072001)(53806001)(76786001)(76796001)(36756003)(54356001);DIR:OUT;SFP:;SCL:1;SRVR:SINPR03MB026;H:SINPR03MB028.apcprd03.prod.outlook.com;CLIP:75.101.50.158;RD:InfoNoRecords;MX:1;A:1;LANG:en; Received: from mail4-ch1 (localhost.localdomain [127.0.0.1]) by mail4-ch1 (MessageSwitch) id 13771357126722_2139; Thu, 22 Aug 2013 01:41:52 +0000 (UTC) Received: from CH1EHSMHS023.bigfish.com (snatpool1.int.messaging.microsoft.com [10.43.68.248]) by mail4-ch1.bigfish.com (Postfix) with ESMTP id F1F86240047; Thu, 22 Aug 2013 01:41:51 +0000 (UTC) Received: from HKNPRD0310HT002.apcprd03.prod.outlook.com (111.221.112.37) by CH1EHSMHS023.bigfish.com (10.43.70.23) with Microsoft SMTP Server (TLS) id 14.16.227.3; Thu, 22 Aug 2013 01:41:51 +0000 Received: from SINPR03MB026.apcprd03.prod.outlook.com (10.242.50.142) by HKNPRD0310HT002.apcprd03.prod.outlook.com (10.255.1.37) with Microsoft SMTP Server (TLS) id 14.16.347.3; Thu, 22 Aug 2013 01:41:50 +0000 Received: from SINPR03MB028.apcprd03.prod.outlook.com (10.242.50.153) by SINPR03MB026.apcprd03.prod.outlook.com (10.242.50.142) with Microsoft SMTP Server (TLS) id 15.0.745.25; Thu, 22 Aug 2013 01:41:48 +0000 Received: from SINPR03MB028.apcprd03.prod.outlook.com ([169.254.13.68]) by SINPR03MB028.apcprd03.prod.outlook.com ([169.254.13.68]) with mapi id 15.00.0745.000; Thu, 22 Aug 2013 01:41:48 +0000 From: Eric Wasserman To: Mark CC: "user@avro.apache.org" Subject: Re: Schema Registry? Thread-Topic: Schema Registry? Thread-Index: AQHOncEmq5LydoG6o0SUD0rv+ql3L5meYzAAgAHKSoCAAEiKZg== Date: Thu, 22 Aug 2013 01:41:47 +0000 Message-ID: <975796BD-D717-4A65-98CB-D96E77111C71@247-inc.com> References: <140DC851-A8C4-4FAA-8325-7685646CF881@gmail.com> <6D4574B3-4C4D-4A24-AE18-E18E0DBAA89F@247-inc.com>,<006414B3-8152-4443-B1F1-263B9A4A4E46@gmail.com> In-Reply-To: <006414B3-8152-4443-B1F1-263B9A4A4E46@gmail.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [75.101.50.158] x-forefront-prvs: 0946DC87A1 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: 247-inc.com X-FOPE-CONNECTOR: Id%0$Dn%*$RO%0$TLS%0$FQDN%$TlsDn% X-Virus-Checked: Checked by ClamAV on apache.org Yes we have a Kafka event consumer that creates the files in HDFS. There ar= e other non-Hadoop consumers as well.=20 On Aug 21, 2013, at 2:23 PM, "Mark" wrote: > Some final questions. >=20 > Since there is no need for the schema in each Kafka event do you just out= put the message without the container file (file header, metadata, sync_mar= kers)? If so, how do you get this working with the Kafka hadoop consumers? = Doing it this way, does it require you to write your own consumer to write = to hadoop? >=20 > Thanks >=20 > On Aug 20, 2013, at 11:01 AM, Eric Wasserman wro= te: >=20 >> You may want to check out this Avro feature request: https://issues.apac= he.org/jira/browse/AVRO-1124 >> which has a lot of nice motivation and usage patterns. Unfortunately, it= s not yet a resolved request. >>=20 >> There are really two broad use cases.=20 >>=20 >> 1) The data are "small" compared to the schema (perhaps because its a co= llection or stream of records encoded by different schemas) >> 2) The data are "big" compared to the schema. (very big records or lots = of records that share a schema) >>=20 >> Case (1) is often a candidate for a schema registry. Case (2) not as muc= h. >>=20 >> Examples from my own usage: >>=20 >> For Kafka we include an MD5 digest of the writer's schema with each Mess= age. It is serialized as a concatenation of the fixed-length MD5 and the bi= nary Avro-encoded data. To decode we read off the MD5, look up the schema a= nd use it to decode the remainder of the Message. >> [You could also segregate data written with different schemas into diffe= rent Kafka topics. By knowing which topic a message is under you then arran= ge a way to look up the writer's schema. That lets you avoid even the cost = of including the MD5 in the Messages.] >>=20 >> In either case consumer code needs to look up the full schema from a "re= gistry" in order to do the actual decode the Avro-encoded data. The registr= y serves the full schema that corresponds to the specified MD5 digest. >>=20 >> We use a similar technique for storing MD5-tagged Avro data in "columns"= of Cassandra and so on. >>=20 >> Case (2) is pretty well handled by just embedding the full schema itself= . >>=20 >> For example, for Hadoop you can just use Avro data files which include t= he actual schema in a header. All the record in the file then adhere to tha= t same schema. In this case using a registry to get the writer's schema is = not necessary. >>=20 >> Note: As described in the feature request linked above, some people use = a schema registry as a way of coordinating schema evolution rather than jus= t as a way of making schema access "economical". >>=20 >>=20 >>=20 >> On Aug 20, 2013, at 9:19 AM, Mark wrote: >>=20 >>> Can someone break down how message serialization would work with Avro a= nd a schema registry? We are planning to use Avro with Kafka and I've read = instead of adding a schema to every single event it would be wise to add so= me sort of fingerprint with each message to identify which schema it should= used. What I'm having trouble understanding is, how do we read the fingerp= rint without a schema? Don't we need the schema to deserialize? Same quest= ion goes for working with Hadoop.. how does the input format know which sch= ema to use? >>>=20 >>> Thanks >>=20 >>=20 >=20 >=20