hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <>
Subject [jira] [Commented] (HIVE-17714) move custom SerDe schema considerations into metastore from QL
Date Mon, 13 Nov 2017 20:09:00 GMT


Sergey Shelukhin commented on HIVE-17714:

Hmm... I was writing the below, when I realized something we might be missing. So if this
is resolved, the below applies, otherwise none of the above or below suggestions work as far
as I can tell.
In order to store the derived schema in metastore, wouldn't we need the serde jar to be present
in the first place? To ask it for the schema. Otherwise if we allow users to specify both
columns and external schema, we are outsourcing even the initial correctness, which seems
I think it's reasonable to expect that if a SerDe is used, it should be available to the user
(and metastore). I don't think having extra jars is a problem... the user will anyway have
to have all the jars to actually query the table with the SerDe, right?

==== The below (without jars).

My main concern is about ensuring that the schema stored in metastore is synced with the actual
schema by the serde. These can get out of sync from both sides; Hive columns can be added
and altered despite the serde being present that is responsible for the schema (I filed a
jira somewhere to block the modification like this) - these modifications will be visible
to the users (because of the metastore APIs); for most serde-s however they won't reflect
on the schema that Hive will actually use, so that is confusing.
Some serdes also support schema in external files that we have no control over, and other
such mechanisms could exist.
Verifying schema at use time solves the problem for Hive, however not for other users of the
metastore, which is kind of the point - Hive already ignores metastore columns for these tables,
going instead to the SerDe, so the mismatch is not a problem for it. 
And adding such checks in metastore would mean needing access to jars, at which point we might
as well return the correct schema.
How about this... 
1) We can remove the logic that avoids storing schema in metastore entirely, and always store
the schema, like before.
2) Metastore will try to get SerDe class on reads, and if available, will return the schema
from SerDe, or do a compat check as suggested above.
3) We could add a compat flag (like the one added for MM tables that fails getTable/etc calls
for them unless the client explicitly claims to support MM tables, or disables compat checks)
 that will break everyone trying to access such tables when the jars are absent (so the client
is required to be aware of the potential discrepancy) unless they set a config flag to disable
checks (so they know they might hit some rare issues), or actually implement the equivalent
of get-from-deserializer.

> move custom SerDe schema considerations into metastore from QL
> --------------------------------------------------------------
>                 Key: HIVE-17714
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Alan Gates
> Columns in metastore for tables that use external schema don't have the type information
(since HIVE-11985) and may be entirely inconsistent (since forever, due to issues like HIVE-17713;
or for SerDes that allow an URL for the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA, and to MetaStoreUtils.getFieldsFromDeserializer,
you'd see that the code in QL handles this in Hive. So, for the most part metastore just returns
whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is interesting...
so getTable will return incorrect columns (potentially), but get_fields/get_schema will return
correct ones from SerDe as far as I can tell.
> As part of separating the metastore, we should make sure all the APIs return the correct
schema for the columns; it's not a good idea to have everyone reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731

This message was sent by Atlassian JIRA

View raw message