Mailing-List: contact dev-help@avro.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@avro.apache.org
Date: Tue, 21 Oct 2014 04:27:38 +0000 (UTC)
From: "Lewis John McGibbney (JIRA)" <jira@apache.org>
To: dev@avro.apache.org
Message-ID: <JIRA.12598315.1341978401000.302885.1413865658670@Atlassian.JIRA>
In-Reply-To: <JIRA.12598315.1341978401000@Atlassian.JIRA>
References: <JIRA.12598315.1341978401000@Atlassian.JIRA>
 <JIRA.12598315.1341978401557@arcas>
Subject: [jira] [Commented] (AVRO-1124) RESTful service for holding schemas
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/AVRO-1124?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14177=
927#comment-14177927 ]=20

Lewis John McGibbney commented on AVRO-1124:
--------------------------------------------

Hi [~felixgv] thanks for clarification.. the patch is a beast so I missed t=
he key value config file {code}lang/java/schema-repo/bundle/config/config.p=
roperties{code}
Thanks for the corrections here, I must say however that my suggestion stil=
l stands :) if anyone is interested in implementing a Gora backend for this=
 which would provide access to a number of underlying storage options then =
please get in touch. There could potentially be a much less investment in a=
ctual code writing if this were done in Gora instead of writing an HBase on=
e, then a ZooKeeper one, etc. Thank you very much for the context [~felixgv=
]

> RESTful service for holding schemas
> -----------------------------------
>
>                 Key: AVRO-1124
>                 URL: https://issues.apache.org/jira/browse/AVRO-1124
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>            Assignee: Jay Kreps
>         Attachments: AVRO-1124-can-read-with.patch, AVRO-1124-draft.patch=
, AVRO-1124-validators-preliminary.patch, AVRO-1124.2.patch, AVRO-1124.3.pa=
tch, AVRO-1124.4.patch, AVRO-1124.patch, AVRO-1124.patch
>
>
> Motivation: It is nice to be able to pass around data in serialized form =
but still know the exact schema that was used to serialize it. The overhead=
 of storing the schema with each record is too high unless the individual r=
ecords are very large. There are workarounds for some common cases: in the =
case of files a schema can be stored once with a file of many records amort=
izing the per-record cost, and in the case of RPC the schema can be negotia=
ted ahead of time and used for many requests. For other uses, though it is =
nice to be able to pass a reference to a given schema using a small id and =
allow this to be looked up. Since only a small number of schemas are likely=
 to be active for a given data source, these can easily be cached, so the n=
umber of remote lookups is very small (one per active schema version).
> Basically this would consist of two things:
> 1. A simple REST service that stores and retrieves schemas
> 2. Some helper java code for fetching and caching schemas for people usin=
g the registry
> We have used something like this at LinkedIn for a few years now, and it =
would be nice to standardize this facility to be able to build up common to=
oling around it. This proposal will be based on what we have, but we can ch=
ange it as ideas come up.
> The facilities this provides are super simple, basically you can register=
 a schema which gives back a unique id for it or you can query for a schema=
. There is almost no code, and nothing very complex. The contract is that b=
efore emitting/storing a record you must first publish its schema to the re=
gistry or know that it has already been published (by checking your cache o=
f published schemas). When reading you check your cache and if you don't fi=
nd the id/schema pair there you query the registry to look it up. I will ex=
plain some of the nuances in more detail below.=20
> An added benefit of such a repository is that it makes a few other things=
 possible:
> 1. A graphical browser of the various data types that are currently used =
and all their previous forms.
> 2. Automatic enforcement of compatibility rules. Data is always compatibl=
e in the sense that the reader will always deserialize it (since they are u=
sing the same schema as the writer) but this does not mean it is compatible=
 with the expectations of the reader. For example if an int field is change=
d to a string that will almost certainly break anyone relying on that field=
. This definition of compatibility can differ for different use cases and s=
hould likely be pluggable.
> Here is a description of one of our uses of this facility at LinkedIn. We=
 use this to retain a schema with "log" data end-to-end from the producing =
app to various real-time consumers as well as a set of resulting AvroFile i=
n Hadoop. This schema metadata can then be used to auto-create hive tables =
(or add new fields to existing tables), or inferring pig fields, all withou=
t manual intervention. One important definition of compatibility that is ni=
ce to enforce is compatibility with historical data for a given "table". Lo=
g data is usually loaded in an append-only manner, so if someone changes an=
 int field in a particular data set to be a string, tools like pig or hive =
that expect static columns will be unusable. Even using plain-vanilla map/r=
educe processing data where columns and types change willy nilly is painful=
. However the person emitting this kind of data may not know all the detail=
s of compatible schema evolution. We use the schema repository to validate =
that any change made to a schema don't violate the compatibility model, and=
 reject the update if it does. We do this check both at run time, and also =
as part of the ant task that generates specific record code (as an early wa=
rning).=20
> Some details to consider:
> Deployment
> This can just be programmed against the servlet API and deploy as a stand=
ard war. You have lots of instances and load balance traffic over them.
> Persistence
> The storage needs are not very heavy. The clients are expected to cache t=
he id=3D>schema mapping, and the server can cache as well. Even after sever=
al years of heavy use we have <50k schemas, each of which is pretty small. =
I think this part can be made pluggable and we can provide a jdbc- and file=
-based implementation as these don't require outlandish dependencies. Peopl=
e can easily plug in their favorite key-value store thingy if they like by =
implementing the right plugin interface. Actual reads will virtually always=
 be cached in memory so this is not too important.
> Group
> In order to get the "latest" schema or handle compatibility enforcement o=
n changes there has to be some way to group a set of schemas together and r=
eason about the ordering of changes over these. I am going to call the grou=
ping the "group". In our usage it is always the table or topic to which the=
 schema is associated. For most of our usage the group name also happens to=
 be the Record name as all of our schemas are records and our default is to=
 have these match. There are use cases, though, where a single schema is us=
ed for multiple topics, each which is modeled independently. The proposal i=
s not to enforce a particular convention but just to expose the group desig=
nator in the API. It would be possible to make the concept of group optiona=
l, but I can't come up with an example where that would be useful.
> Compatibility
> There are really different requirements for different use cases on what i=
s considered an allowable change. Likewise it is useful to be able to exten=
d this to have other kinds of checks (for example, in retrospect, I really =
wish we had required doc fields to be present so we could require documenta=
tion of fields as well as naming conventions). There can be some kind of ge=
neral pluggable interface for this like=20
>    SchemaChangeValidator.isValidChange(currentLatest, proposedNew)
> A reasonable implementation can be provided that does checks based on the=
 rules in http://avro.apache.org/docs/current/spec.html#Schema+Resolution. =
Be default no checks need to be done. Ideally you should be able to have mo=
re than one policy (say one treatment for database schemas, one for logging=
 event schemas, and one which does no checks at all). I can't imagine a nee=
d for more than a handful of these which would be statically configured (db=
_policy=3Dcom.mycompany.DBSchemaChangePolicy, noop=3Dorg.apache.avro.NoOpPo=
licy,...). Each group can configure the policy it wants to be used going fo=
rward with the default being none.
> Security and Authentication
> There isn't any of this. The assumption is that this service is not publi=
cly available and those accessing it are honest (though perhaps accident pr=
one). These are just schemas, after all.
> Ids
> There are a couple of questions about ids how we make ids to represent th=
e schemas:
> 1. Are they sequential (1,2,3..) or hash based? If hash based, what is su=
fficient collision probability?=20
> 2. Are they global or per-group? That is, if I know the id do I also need=
 to know the group to look up the schema?
> 3. What kind of change triggers a new id? E.g. if I update a doc field do=
es that give a new id? If not then that doc field will not be stored.
> For the id generation there are various options:
> - A sequential integer
> - AVRO-1006 creates a schema-specific 64-bit hash.
> - Our current implementation at LinkedIn uses the MD5 of the schema as th=
e id.
> Our current implementation at LinkedIn uses the MD5 of the schema text af=
ter removing whitespace. The additional attributes like doc fields (and a f=
ew we made up) are actually important to us and we want them maintained (we=
 add metadata fields of our own). This does mean we have some updates that =
generate a new schema id but don't cause a very meaningful semantic change =
to the schema (say because someone tweaked their doc string), but this does=
n't hurt anything and it is nice to have the exact schema text represented.=
 An example of uses these metadata fields is using the schema doc fields as=
 the hive column doc fields.
> The id is actually just a unique identifier, and the id generation algori=
thm can be made pluggable if there is a real trade-off. In retrospect I don=
't think using the md5 is good because it is 16 bytes, which for a small me=
ssage is bulkier than needed. Since the id is retained with each message, s=
ize is a concern.
> The AVRO-1006 fingerprint is super cool, but I have a couple concerns (po=
ssibly just due to misunderstanding):
> 1. Seems to produce a 64-bit id. For a large number of schemas, 64 bits m=
akes collisions unlikely but not unthinkable. Whether or not this matters d=
epends on whether schemas are versioned per group or globally. If they are =
per group it may be okay, since most groups should only have a few hundred =
schema versions at most. If they are global I think it will be a problem. P=
robabilities for collision are given here under the assumption of perfect u=
niformity of the hash (it may be worse, but can't be better) http://en.wiki=
pedia.org/wiki/Birthday_attack. If we did have a collision we would be dead=
 in the water, since our data would be unreadable. If this becomes a standa=
rd mechanism for storing schemas people will run into this problem.
> 2. Even 64-bits is a bit bulky. Since this id needs to be stored with eve=
ry row size is a concern, though a minor one.
> 3. The notion of equivalence seems to throw away many things in the schem=
a (doc, attributes, etc). This is unfortunate. One nice thing about avro is=
 you can add your own made-up attributes to the schema since it is just JSO=
N. This acts as a kind of poor-mans metadata repository. It would be nice t=
o have these maintained rather than discarded.
> It is possible that I am misunderstanding the fingerprint scheme, though,=
 so please correct me.
> My personal preference would be to use a sequential id per group. The mai=
n reason I like this is because the id doubles as the version number, i.e. =
my_schema/4 is the 4th version of the my_schema record/group. Persisted dat=
a then only needs to store the varint encoding of the version number, which=
 is generally going to be 1 byte for a few hundred schema updates. The stri=
ng my_schema/4 acts as a global id for this. This does allow per-group shar=
ding for id generation, but sharding seems unlikely to be needed here. A 50=
GB database would store 52 million schemas. 52 million schemas "should be e=
nough for anyone". :-)
> Probably the easiest thing would be to just make the id generation scheme=
 pluggable. That would kind of satisfy everyone, and, as a side-benefit giv=
e us at linkedin a gradual migration path off our md5-based ids. In this ca=
se ids would basically be opaque url-safe strings from the point of view of=
 the repository and users could munge this id and encode it as they like.
> APIs
> Here are the proposed APIs. This tacitly assumes ids are per-group, but t=
he change if pretty minor if not:
> Get a schema by id
> GET /schemas/<group>/<id>
> If the schema exists the response code will be 200 and the response body =
will be the schema text.
> If it doesn't exist the response will be 404.
> GET /schemas
> Produces a list of group names, one per line.
> GET /schemas/group
> Produces a list of versions for the given group, one per line.
> GET /schemas/group/latest
> If the group exists the response code will be 200 and the response body w=
ill be the schema text of the last registered schema.
> If the group doesn't exist the response code will be 404.
> Register a schema
> POST /schemas/groups/<group_name>
> Parameters:
> schema=3D<text of schema>
> compatibility_model=3DXYZ
> force_override=3D(true|false)
> There are a few cases:
> If the group exists and the change is incompatible with the current lates=
t, the server response code will be 403 (forbidden) UNLESS the force_overri=
de flag is set in which case not check will be made.
> If the server doesn't have an implementation corresponding to the given c=
ompatibility model key it will give a response code 400
> If the group does not exist it will be created with the given schema (and=
 compatibility model)
> If the group exists and this schema has already been registered the serve=
r returns response code 200 and the id already assigned to that schema
> If the group exists, but this schema hasn't been registered, and the comp=
atibility checks pass, then the response code will be 200 and it will store=
 the schema and return the id of the schema
> The force_override flag allows registering an incompatible schema. We hav=
e found that sometimes you know "for sure" that your change is okay and jus=
t want to damn the torpedoes and charge ahead. This would be intended for m=
anual rather than programmatic usage.
> Intended Usage
> Let's assume we are implementing a put and get API as a database would ha=
ve using this registry, there is no substantial difference for a messaging =
style api. Here are the details of how this works:
> Say you have two methods=20
>   void put(table, key, record)
>   Record get(table, key)
> Put is expected to do the following under the covers:
> 1. Check the record's schema against a local cache of schema=3D>id to get=
 the schema id
> 3. If it is not found then register it with the schema registry and get b=
ack a schema id and add this pair to the cache
> 4. Store the serialized record bytes and schema id
> Get is expected to do the following:
> 1. Retrieve the serialized record bytes and schema id from the store
> 2. Check a local cache to see if this schema is known for this schema id
> 3. If not, fetch the schema by id from the schema registry
> 4. Deserialize the record using the schema and return it
> Code Layout
> Where to put this code? Contrib package? Elsewhere? Someone should tell m=
e...


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)