avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thiruvalluvan M. G. (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1006) Fingerprints for Avro Schemas
Date Wed, 08 Feb 2012 03:19:09 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203219#comment-13203219

Thiruvalluvan M. G. commented on AVRO-1006:

Doug's point about JSON being an unordered format is important and limits using the json string
as the fingerprint.
Perhaps we can complete the Avro Schema for schemas (AVRO-251) which can define field order
and equivalence unambiguously and all implementations should be able to support. The output
bytes from the Avro binary serialization of the schema can be used to feed a hash algorithm.

While representing the canonical schema as Avro data reduces it (compared to Json representation)
it does not eliminate ambiguity. Non-empty arrays (and maps) can be represented in Avro in
more than one way.

Doug's observation implies that we cannot use a third-party Json library to generate the canonical
representation. For fingerprinting to work, we need some canonical representation (which by
definition is not ambiguous). Either we restrict (by removing ambiguities) an existing standard
or invent a new one.

I think Raymie's canonicalization rules are simple and given that we'll have only US-ASCII
characters in the canonical representation, writing a JSON generator in any language will
not be hard. And it will be parsable (with no new code) and human-readable.
> Fingerprints for Avro Schemas
> -----------------------------
>                 Key: AVRO-1006
>                 URL: https://issues.apache.org/jira/browse/AVRO-1006
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Raymie Stata
>            Assignee: Raymie Stata
>              Labels: features
>         Attachments: schema-fingerprinting.html, schema-fingerprinting.html, schema-fingerprinting.html
> Add function that returns a standardized, 64-bit fingerprint for schemas.  Fingerprints
are designed such that the chances of collisions is very, very low.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message