Return-Path: X-Original-To: apmail-avro-dev-archive@www.apache.org Delivered-To: apmail-avro-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2718C1754C for ; Tue, 21 Oct 2014 04:27:39 +0000 (UTC) Received: (qmail 84182 invoked by uid 500); 21 Oct 2014 04:27:38 -0000 Delivered-To: apmail-avro-dev-archive@avro.apache.org Received: (qmail 84114 invoked by uid 500); 21 Oct 2014 04:27:38 -0000 Mailing-List: contact dev-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@avro.apache.org Delivered-To: mailing list dev@avro.apache.org Received: (qmail 84103 invoked by uid 99); 21 Oct 2014 04:27:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Oct 2014 04:27:38 +0000 Date: Tue, 21 Oct 2014 04:27:38 +0000 (UTC) From: "Lewis John McGibbney (JIRA)" To: dev@avro.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (AVRO-1124) RESTful service for holding schemas MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AVRO-1124?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14177= 927#comment-14177927 ]=20 Lewis John McGibbney commented on AVRO-1124: -------------------------------------------- Hi [~felixgv] thanks for clarification.. the patch is a beast so I missed t= he key value config file {code}lang/java/schema-repo/bundle/config/config.p= roperties{code} Thanks for the corrections here, I must say however that my suggestion stil= l stands :) if anyone is interested in implementing a Gora backend for this= which would provide access to a number of underlying storage options then = please get in touch. There could potentially be a much less investment in a= ctual code writing if this were done in Gora instead of writing an HBase on= e, then a ZooKeeper one, etc. Thank you very much for the context [~felixgv= ] > RESTful service for holding schemas > ----------------------------------- > > Key: AVRO-1124 > URL: https://issues.apache.org/jira/browse/AVRO-1124 > Project: Avro > Issue Type: New Feature > Reporter: Jay Kreps > Assignee: Jay Kreps > Attachments: AVRO-1124-can-read-with.patch, AVRO-1124-draft.patch= , AVRO-1124-validators-preliminary.patch, AVRO-1124.2.patch, AVRO-1124.3.pa= tch, AVRO-1124.4.patch, AVRO-1124.patch, AVRO-1124.patch > > > Motivation: It is nice to be able to pass around data in serialized form = but still know the exact schema that was used to serialize it. The overhead= of storing the schema with each record is too high unless the individual r= ecords are very large. There are workarounds for some common cases: in the = case of files a schema can be stored once with a file of many records amort= izing the per-record cost, and in the case of RPC the schema can be negotia= ted ahead of time and used for many requests. For other uses, though it is = nice to be able to pass a reference to a given schema using a small id and = allow this to be looked up. Since only a small number of schemas are likely= to be active for a given data source, these can easily be cached, so the n= umber of remote lookups is very small (one per active schema version). > Basically this would consist of two things: > 1. A simple REST service that stores and retrieves schemas > 2. Some helper java code for fetching and caching schemas for people usin= g the registry > We have used something like this at LinkedIn for a few years now, and it = would be nice to standardize this facility to be able to build up common to= oling around it. This proposal will be based on what we have, but we can ch= ange it as ideas come up. > The facilities this provides are super simple, basically you can register= a schema which gives back a unique id for it or you can query for a schema= . There is almost no code, and nothing very complex. The contract is that b= efore emitting/storing a record you must first publish its schema to the re= gistry or know that it has already been published (by checking your cache o= f published schemas). When reading you check your cache and if you don't fi= nd the id/schema pair there you query the registry to look it up. I will ex= plain some of the nuances in more detail below.=20 > An added benefit of such a repository is that it makes a few other things= possible: > 1. A graphical browser of the various data types that are currently used = and all their previous forms. > 2. Automatic enforcement of compatibility rules. Data is always compatibl= e in the sense that the reader will always deserialize it (since they are u= sing the same schema as the writer) but this does not mean it is compatible= with the expectations of the reader. For example if an int field is change= d to a string that will almost certainly break anyone relying on that field= . This definition of compatibility can differ for different use cases and s= hould likely be pluggable. > Here is a description of one of our uses of this facility at LinkedIn. We= use this to retain a schema with "log" data end-to-end from the producing = app to various real-time consumers as well as a set of resulting AvroFile i= n Hadoop. This schema metadata can then be used to auto-create hive tables = (or add new fields to existing tables), or inferring pig fields, all withou= t manual intervention. One important definition of compatibility that is ni= ce to enforce is compatibility with historical data for a given "table". Lo= g data is usually loaded in an append-only manner, so if someone changes an= int field in a particular data set to be a string, tools like pig or hive = that expect static columns will be unusable. Even using plain-vanilla map/r= educe processing data where columns and types change willy nilly is painful= . However the person emitting this kind of data may not know all the detail= s of compatible schema evolution. We use the schema repository to validate = that any change made to a schema don't violate the compatibility model, and= reject the update if it does. We do this check both at run time, and also = as part of the ant task that generates specific record code (as an early wa= rning).=20 > Some details to consider: > Deployment > This can just be programmed against the servlet API and deploy as a stand= ard war. You have lots of instances and load balance traffic over them. > Persistence > The storage needs are not very heavy. The clients are expected to cache t= he id=3D>schema mapping, and the server can cache as well. Even after sever= al years of heavy use we have <50k schemas, each of which is pretty small. = I think this part can be made pluggable and we can provide a jdbc- and file= -based implementation as these don't require outlandish dependencies. Peopl= e can easily plug in their favorite key-value store thingy if they like by = implementing the right plugin interface. Actual reads will virtually always= be cached in memory so this is not too important. > Group > In order to get the "latest" schema or handle compatibility enforcement o= n changes there has to be some way to group a set of schemas together and r= eason about the ordering of changes over these. I am going to call the grou= ping the "group". In our usage it is always the table or topic to which the= schema is associated. For most of our usage the group name also happens to= be the Record name as all of our schemas are records and our default is to= have these match. There are use cases, though, where a single schema is us= ed for multiple topics, each which is modeled independently. The proposal i= s not to enforce a particular convention but just to expose the group desig= nator in the API. It would be possible to make the concept of group optiona= l, but I can't come up with an example where that would be useful. > Compatibility > There are really different requirements for different use cases on what i= s considered an allowable change. Likewise it is useful to be able to exten= d this to have other kinds of checks (for example, in retrospect, I really = wish we had required doc fields to be present so we could require documenta= tion of fields as well as naming conventions). There can be some kind of ge= neral pluggable interface for this like=20 > SchemaChangeValidator.isValidChange(currentLatest, proposedNew) > A reasonable implementation can be provided that does checks based on the= rules in http://avro.apache.org/docs/current/spec.html#Schema+Resolution. = Be default no checks need to be done. Ideally you should be able to have mo= re than one policy (say one treatment for database schemas, one for logging= event schemas, and one which does no checks at all). I can't imagine a nee= d for more than a handful of these which would be statically configured (db= _policy=3Dcom.mycompany.DBSchemaChangePolicy, noop=3Dorg.apache.avro.NoOpPo= licy,...). Each group can configure the policy it wants to be used going fo= rward with the default being none. > Security and Authentication > There isn't any of this. The assumption is that this service is not publi= cly available and those accessing it are honest (though perhaps accident pr= one). These are just schemas, after all. > Ids > There are a couple of questions about ids how we make ids to represent th= e schemas: > 1. Are they sequential (1,2,3..) or hash based? If hash based, what is su= fficient collision probability?=20 > 2. Are they global or per-group? That is, if I know the id do I also need= to know the group to look up the schema? > 3. What kind of change triggers a new id? E.g. if I update a doc field do= es that give a new id? If not then that doc field will not be stored. > For the id generation there are various options: > - A sequential integer > - AVRO-1006 creates a schema-specific 64-bit hash. > - Our current implementation at LinkedIn uses the MD5 of the schema as th= e id. > Our current implementation at LinkedIn uses the MD5 of the schema text af= ter removing whitespace. The additional attributes like doc fields (and a f= ew we made up) are actually important to us and we want them maintained (we= add metadata fields of our own). This does mean we have some updates that = generate a new schema id but don't cause a very meaningful semantic change = to the schema (say because someone tweaked their doc string), but this does= n't hurt anything and it is nice to have the exact schema text represented.= An example of uses these metadata fields is using the schema doc fields as= the hive column doc fields. > The id is actually just a unique identifier, and the id generation algori= thm can be made pluggable if there is a real trade-off. In retrospect I don= 't think using the md5 is good because it is 16 bytes, which for a small me= ssage is bulkier than needed. Since the id is retained with each message, s= ize is a concern. > The AVRO-1006 fingerprint is super cool, but I have a couple concerns (po= ssibly just due to misunderstanding): > 1. Seems to produce a 64-bit id. For a large number of schemas, 64 bits m= akes collisions unlikely but not unthinkable. Whether or not this matters d= epends on whether schemas are versioned per group or globally. If they are = per group it may be okay, since most groups should only have a few hundred = schema versions at most. If they are global I think it will be a problem. P= robabilities for collision are given here under the assumption of perfect u= niformity of the hash (it may be worse, but can't be better) http://en.wiki= pedia.org/wiki/Birthday_attack. If we did have a collision we would be dead= in the water, since our data would be unreadable. If this becomes a standa= rd mechanism for storing schemas people will run into this problem. > 2. Even 64-bits is a bit bulky. Since this id needs to be stored with eve= ry row size is a concern, though a minor one. > 3. The notion of equivalence seems to throw away many things in the schem= a (doc, attributes, etc). This is unfortunate. One nice thing about avro is= you can add your own made-up attributes to the schema since it is just JSO= N. This acts as a kind of poor-mans metadata repository. It would be nice t= o have these maintained rather than discarded. > It is possible that I am misunderstanding the fingerprint scheme, though,= so please correct me. > My personal preference would be to use a sequential id per group. The mai= n reason I like this is because the id doubles as the version number, i.e. = my_schema/4 is the 4th version of the my_schema record/group. Persisted dat= a then only needs to store the varint encoding of the version number, which= is generally going to be 1 byte for a few hundred schema updates. The stri= ng my_schema/4 acts as a global id for this. This does allow per-group shar= ding for id generation, but sharding seems unlikely to be needed here. A 50= GB database would store 52 million schemas. 52 million schemas "should be e= nough for anyone". :-) > Probably the easiest thing would be to just make the id generation scheme= pluggable. That would kind of satisfy everyone, and, as a side-benefit giv= e us at linkedin a gradual migration path off our md5-based ids. In this ca= se ids would basically be opaque url-safe strings from the point of view of= the repository and users could munge this id and encode it as they like. > APIs > Here are the proposed APIs. This tacitly assumes ids are per-group, but t= he change if pretty minor if not: > Get a schema by id > GET /schemas// > If the schema exists the response code will be 200 and the response body = will be the schema text. > If it doesn't exist the response will be 404. > GET /schemas > Produces a list of group names, one per line. > GET /schemas/group > Produces a list of versions for the given group, one per line. > GET /schemas/group/latest > If the group exists the response code will be 200 and the response body w= ill be the schema text of the last registered schema. > If the group doesn't exist the response code will be 404. > Register a schema > POST /schemas/groups/ > Parameters: > schema=3D > compatibility_model=3DXYZ > force_override=3D(true|false) > There are a few cases: > If the group exists and the change is incompatible with the current lates= t, the server response code will be 403 (forbidden) UNLESS the force_overri= de flag is set in which case not check will be made. > If the server doesn't have an implementation corresponding to the given c= ompatibility model key it will give a response code 400 > If the group does not exist it will be created with the given schema (and= compatibility model) > If the group exists and this schema has already been registered the serve= r returns response code 200 and the id already assigned to that schema > If the group exists, but this schema hasn't been registered, and the comp= atibility checks pass, then the response code will be 200 and it will store= the schema and return the id of the schema > The force_override flag allows registering an incompatible schema. We hav= e found that sometimes you know "for sure" that your change is okay and jus= t want to damn the torpedoes and charge ahead. This would be intended for m= anual rather than programmatic usage. > Intended Usage > Let's assume we are implementing a put and get API as a database would ha= ve using this registry, there is no substantial difference for a messaging = style api. Here are the details of how this works: > Say you have two methods=20 > void put(table, key, record) > Record get(table, key) > Put is expected to do the following under the covers: > 1. Check the record's schema against a local cache of schema=3D>id to get= the schema id > 3. If it is not found then register it with the schema registry and get b= ack a schema id and add this pair to the cache > 4. Store the serialized record bytes and schema id > Get is expected to do the following: > 1. Retrieve the serialized record bytes and schema id from the store > 2. Check a local cache to see if this schema is known for this schema id > 3. If not, fetch the schema by id from the schema registry > 4. Deserialize the record using the schema and return it > Code Layout > Where to put this code? Contrib package? Elsewhere? Someone should tell m= e... -- This message was sent by Atlassian JIRA (v6.3.4#6332)