Mailing-List: contact user-help@avro.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@avro.apache.org
MIME-Version: 1.0
From: Daniel Schierbeck <daniel.schierbeck@gmail.com>
Date: Thu, 09 Jul 2015 08:36:20 +0000
Message-ID: 
 <CAAeOB6fnzsYDut5JEO=zphoNzHeSjbvj_hNruJESPD0dREeG_A@mail.gmail.com>
Subject: Using Avro for encoding messages
To: user@avro.apache.org
Content-Type: multipart/alternative; boundary=047d7ba97904f25a6c051a6d2661

--047d7ba97904f25a6c051a6d2661
Content-Type: text/plain; charset=UTF-8

I'm working on a system that will store Avro-encoded messages in Kafka. The
system will have both producers and consumers in different languages,
including Ruby (not JRuby) and Java.

At the moment I'm encoding each message as a data file, which means that
the full schema is included in each encoded message. This is obviously
suboptimal, but it doesn't seem like there's a standardized format for
single-message Avro encodings.

I've reviewed Confluent's schema-registry offering, but that seems to be
overkill for my needs, and would require me to run and maintain yet another
piece of infrastructure. Ideally, I wouldn't have to use anything besides
Kafka.

Is this something that other people have experience with?

I've come up with a scheme that would seem to work well independently of
what kind of infrastructure you're using: whenever a writer process is
asked to encode a message m with schema s for the first time, it broadcasts
(s', s) to a schema registry, where s' is the fingerprint of s. The schema
registry in this case can be pluggable, and can be any mechanism that
allows different processes to access the schemas. The writer then encodes
the message as (s', m), i.e. only includes the schema fingerprint. A
reader, when first encountering a message with a schema fingerprint s',
looks up s from the schema registry and uses s to decode the message.

Here, the concept of a schema registry has been abstracted away and is not
tied to the concept of "schema ids" and versions. Furthermore, there are
some desirable traits:

1. Schemas are identified by their fingerprints, so there's no need for an
external system to issue schema ids.
2. Writing (s', s) pairs is idempotent, so there's no need to coordinate
that task. If you've got a system with many writers, you can let all of
them broadcast their schemas when they boot or when they need to encode
data using the schemas.
3. It would work using a range of different backends for the schema
registry. Simple key-value stores would obviously work, but for my case I'd
probably want to use Kafka itself. If the schemas are writting to a topic
with key-based compaction, where s' is the message key and s is the message
value, then Kafka would automatically clean up duplicates over time. This
would save me from having to add more pieces to my infrastructure.

Has this problem been solved already? If not, would it make sense to define
a common "message format" that defined the structure of (s', m) pairs?

Cheers,
Daniel Schierbeck

--047d7ba97904f25a6c051a6d2661
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;m working on a system that will store Avro-encoded m=
essages in Kafka. The system will have both producers and consumers in diff=
erent languages, including Ruby (not JRuby) and Java.<div><br></div><div>At=
 the moment I&#39;m encoding each message as a data file, which means that =
the full schema is included in each encoded message. This is obviously subo=
ptimal, but it doesn&#39;t seem like there&#39;s a standardized format for =
single-message Avro encodings.</div><div><br></div><div>I&#39;ve reviewed C=
onfluent&#39;s schema-registry offering, but that seems to be overkill for =
my needs, and would require me to run and maintain yet another piece of inf=
rastructure. Ideally, I wouldn&#39;t have to use anything besides Kafka.</d=
iv><div><br></div><div>Is this something that other people have experience =
with?</div><div><br></div><div>I&#39;ve come up with a scheme that would se=
em to work well independently of what kind of infrastructure you&#39;re usi=
ng: whenever a writer process is asked to encode a message m with schema s =
for the first time, it broadcasts (s&#39;, s) to a schema registry, where s=
&#39; is the fingerprint of s. The schema registry in this case can be plug=
gable, and can be any mechanism that allows different processes to access t=
he schemas. The writer then encodes the message as (s&#39;, m), i.e. only i=
ncludes the schema fingerprint. A reader, when first encountering a message=
 with a schema fingerprint s&#39;, looks up s from the schema registry and =
uses s to decode the message.</div><div><br></div><div>Here, the concept of=
 a schema registry has been abstracted away and is not tied to the concept =
of &quot;schema ids&quot; and versions. Furthermore, there are some desirab=
le traits:</div><div><br></div><div>1. Schemas are identified by their fing=
erprints, so there&#39;s no need for an external system to issue schema ids=
.</div><div>2. Writing (s&#39;, s) pairs is idempotent, so there&#39;s no n=
eed to coordinate that task. If you&#39;ve got a system with many writers, =
you can let all of them broadcast their schemas when they boot or when they=
 need to encode data using the schemas.</div><div>3. It would work using a =
range of different backends for the schema registry. Simple key-value store=
s would obviously work, but for my case I&#39;d probably want to use Kafka =
itself. If the schemas are writting to a topic with key-based compaction, w=
here s&#39; is the message key and s is the message value, then Kafka would=
 automatically clean up duplicates over time. This would save me from havin=
g to add more pieces to my infrastructure.</div><div><br></div><div>Has thi=
s problem been solved already? If not, would it make sense to define a comm=
on &quot;message format&quot; that defined the structure of (s&#39;, m) pai=
rs?</div><div><br></div><div>Cheers,</div><div>Daniel Schierbeck</div></div=
>

--047d7ba97904f25a6c051a6d2661--