Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5818418FA9 for ; Thu, 9 Jul 2015 08:36:33 +0000 (UTC) Received: (qmail 42121 invoked by uid 500); 9 Jul 2015 08:36:33 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 42045 invoked by uid 500); 9 Jul 2015 08:36:32 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 42035 invoked by uid 99); 9 Jul 2015 08:36:32 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jul 2015 08:36:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 6ED50D338C for ; Thu, 9 Jul 2015 08:36:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id OjgLpHuQlJf3 for ; Thu, 9 Jul 2015 08:36:31 +0000 (UTC) Received: from mail-wg0-f46.google.com (mail-wg0-f46.google.com [74.125.82.46]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id EA44520F10 for ; Thu, 9 Jul 2015 08:36:30 +0000 (UTC) Received: by wgov12 with SMTP id v12so32217908wgo.1 for ; Thu, 09 Jul 2015 01:36:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=AhKBw24bEHFdVgkJw39JvKLrydR2e7vjTdYxbMCqrK4=; b=Lc4orjZi+eL8gD6++xEHUVoGZ6osC6J7WBfMxZIK6bhTzKYPQSeL25IgX+fZ7R0Z93 HTGla00uQxeT2uwLB85MzVzt8XKvREtMFwyVEVb6+/sNSFM4wK9pJ/t77L8+8mxRMCzT KKDsWnMZvrAGKao7YISbkXu7X4DMoGJWWg4K83tyLt0Z50IIomInklZ60YFWCIfZcsoH 2u2MQ7XXggLHRfL6AYIgbnyrnQN0YVDkgCiecW+JEvwf+5vjzN+bFxHzi4f+6G0Nu3UV F9i5h1K2xvM1TePUmT5G2nyTAMuV85qBsR0LMJxJBTT2ORxl9XZRK+jT+6oFElAHMB1f hlSA== X-Received: by 10.194.60.81 with SMTP id f17mr27560671wjr.62.1436430990725; Thu, 09 Jul 2015 01:36:30 -0700 (PDT) MIME-Version: 1.0 From: Daniel Schierbeck Date: Thu, 09 Jul 2015 08:36:20 +0000 Message-ID: Subject: Using Avro for encoding messages To: user@avro.apache.org Content-Type: multipart/alternative; boundary=047d7ba97904f25a6c051a6d2661 --047d7ba97904f25a6c051a6d2661 Content-Type: text/plain; charset=UTF-8 I'm working on a system that will store Avro-encoded messages in Kafka. The system will have both producers and consumers in different languages, including Ruby (not JRuby) and Java. At the moment I'm encoding each message as a data file, which means that the full schema is included in each encoded message. This is obviously suboptimal, but it doesn't seem like there's a standardized format for single-message Avro encodings. I've reviewed Confluent's schema-registry offering, but that seems to be overkill for my needs, and would require me to run and maintain yet another piece of infrastructure. Ideally, I wouldn't have to use anything besides Kafka. Is this something that other people have experience with? I've come up with a scheme that would seem to work well independently of what kind of infrastructure you're using: whenever a writer process is asked to encode a message m with schema s for the first time, it broadcasts (s', s) to a schema registry, where s' is the fingerprint of s. The schema registry in this case can be pluggable, and can be any mechanism that allows different processes to access the schemas. The writer then encodes the message as (s', m), i.e. only includes the schema fingerprint. A reader, when first encountering a message with a schema fingerprint s', looks up s from the schema registry and uses s to decode the message. Here, the concept of a schema registry has been abstracted away and is not tied to the concept of "schema ids" and versions. Furthermore, there are some desirable traits: 1. Schemas are identified by their fingerprints, so there's no need for an external system to issue schema ids. 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate that task. If you've got a system with many writers, you can let all of them broadcast their schemas when they boot or when they need to encode data using the schemas. 3. It would work using a range of different backends for the schema registry. Simple key-value stores would obviously work, but for my case I'd probably want to use Kafka itself. If the schemas are writting to a topic with key-based compaction, where s' is the message key and s is the message value, then Kafka would automatically clean up duplicates over time. This would save me from having to add more pieces to my infrastructure. Has this problem been solved already? If not, would it make sense to define a common "message format" that defined the structure of (s', m) pairs? Cheers, Daniel Schierbeck --047d7ba97904f25a6c051a6d2661 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I'm working on a system that will store Avro-encoded m= essages in Kafka. The system will have both producers and consumers in diff= erent languages, including Ruby (not JRuby) and Java.

At= the moment I'm encoding each message as a data file, which means that = the full schema is included in each encoded message. This is obviously subo= ptimal, but it doesn't seem like there's a standardized format for = single-message Avro encodings.

I've reviewed C= onfluent's schema-registry offering, but that seems to be overkill for = my needs, and would require me to run and maintain yet another piece of inf= rastructure. Ideally, I wouldn't have to use anything besides Kafka.

Is this something that other people have experience = with?

I've come up with a scheme that would se= em to work well independently of what kind of infrastructure you're usi= ng: whenever a writer process is asked to encode a message m with schema s = for the first time, it broadcasts (s', s) to a schema registry, where s= ' is the fingerprint of s. The schema registry in this case can be plug= gable, and can be any mechanism that allows different processes to access t= he schemas. The writer then encodes the message as (s', m), i.e. only i= ncludes the schema fingerprint. A reader, when first encountering a message= with a schema fingerprint s', looks up s from the schema registry and = uses s to decode the message.

Here, the concept of= a schema registry has been abstracted away and is not tied to the concept = of "schema ids" and versions. Furthermore, there are some desirab= le traits:

1. Schemas are identified by their fing= erprints, so there's no need for an external system to issue schema ids= .
2. Writing (s', s) pairs is idempotent, so there's no n= eed to coordinate that task. If you've got a system with many writers, = you can let all of them broadcast their schemas when they boot or when they= need to encode data using the schemas.
3. It would work using a = range of different backends for the schema registry. Simple key-value store= s would obviously work, but for my case I'd probably want to use Kafka = itself. If the schemas are writting to a topic with key-based compaction, w= here s' is the message key and s is the message value, then Kafka would= automatically clean up duplicates over time. This would save me from havin= g to add more pieces to my infrastructure.

Has thi= s problem been solved already? If not, would it make sense to define a comm= on "message format" that defined the structure of (s', m) pai= rs?

Cheers,
Daniel Schierbeck
--047d7ba97904f25a6c051a6d2661--