avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Standardizing char and varchar logical types
Date Thu, 19 Oct 2017 20:03:13 GMT
I don't think this is necessary.

For char and varchar, the underlying storage shouldn't actually do anything
differently. For example, what should Avro do if the user writes a long
string to a VARCHAR(16) field? I think the last thing Avro should do is
drop the extra bytes, so we're forced to do nothing and store the data as
requested. Same thing on read: Avro should pass whatever string was
written, regardless of the logical type and the engine should truncate.

There's also no benefit to these types. UTF-8 may have multi-byte
characters, so we can't use a fixed-length buffer for storage. CHAR and
VARCHAR are, in my opinion, antiquated database types that don't have any
value at the storage layer. I think it makes sense for Hive or Spark to
allow users to get the same behavior, but that should be implemented at the
database level, not at the file level.

Do you know why Hive is storing these annotations in Avro? If I remember
correctly, it is to get around passing the table's types to the read path,
which isn't a good reason to add this in the Avro spec, when the expected
behavior is to do nothing differently (which is itself probably confusing
at first glance).


On Thu, Oct 19, 2017 at 3:19 AM, Zoltan Ivanfi <zi@cloudera.com> wrote:

> Hi,
> Apparently, when saving char or varchar columns to Avro, Hive and Spark add
> non-standard logical type annotations:
> {"type":"string","logicalType":"char","maxLength":42}
> {"type":"string","logicalType":"varchar","maxLength":42}
> Considering that probably these two SQL engines are the creators of the
> majority of all Avro files written so far, I was wondering whether we
> should make these annotations official by adding them to the specification.
> Any opinions?
> Thanks,
> Zoltan

Ryan Blue
Software Engineer

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message