hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-17714) move custom SerDe schema considerations into metastore from QL
Date Tue, 14 Nov 2017 20:10:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252091#comment-16252091
] 

Sergey Shelukhin edited comment on HIVE-17714 at 11/14/17 8:09 PM:
-------------------------------------------------------------------

Let me summarize on the high level.
Metastore not creating the schema for such SerDes is the current state after HIVE-11985
However, that means that most metastore APIs return bogus fields for such tables (only get_schema/get_fields
return correct fields - by calling the deserializer inside metastore).
So that means that everyone who wants to use metastore for such tables needs to know about
these shennanigans (in particular the internal Hive SerDe list from HiveConf, and the fromDeserializer
stuff). And also, metastore compile-time depends on SerDe interface and runtime-depends on
SerDe jars.
We can resolve this either by either:
# removing the SerDe dependency, and either
#.# screwing everyone who wants to read Hive tables without intricate understanding of SerDe/Hive
internals. I know for sure that it will break Presto, but I suspect it will actually break
everyone trying to use metastore at this time :) And I'm not even sure how non-Java users
can support this.
#.# forcing the table creation and updates to externally recreate the schema for the benefit
of the readers. This is not as bad as messing with readers, cause those tables are mostly
created by Hive, but still bad (if external users do create the tables) and also doesn't solve
the external schema case.
# keeping the SerDe dependency
#.# recreating the schema. The old metastore approach before HIVE-11985 that nobody seems
to like.
## changing get_table/etc. APIs to return the correct schema from SerDe (with in-memory caching
for most cases, based on internal config?). This IMO is the right solution.
At compile time, the main dependency that metastore would need is "Deserializer" interface,
not individual SerDes, so it's a reasonable addition to storage-api (or a new module).
At runtime, I think it's reasonable to expect the user to deploy jars with metastore if they
want to use the table, since they'd likely need the same jars anyway to read from the table
using the SerDe  (although it does present some inconvenience to non-Java readers). Also,
if jars are not available we can output an error; and we can optionally add a compat flag
for users that are aware of Hive internals and can override the jar requirement.



was (Author: sershe):
Let me summarize on the high level.
Metastore not creating the schema for such SerDes is the current state after HIVE-11985
However, that means that most metastore APIs return bogus fields for such tables (only get_schema/get_fields
return correct fields - by calling the deserializer inside metastore).
So that means that everyone who wants to use metastore for such tables needs to know about
these shennanigans (in particular the internal Hive SerDe list from HiveConf, and the fromDeserializer
stuff). And also, metastore compile-time depends on SerDe interface and runtime-depends on
SerDe jars.
We can resolve this either by either:
# removing the SerDe dependency, and either
#.# screwing everyone who wants to read Hive tables without intricate understanding of SerDe/Hive
internals. I know for sure that it will break Presto, but I suspect it will actually break
everyone trying to use metastore at this time :) And I'm not even sure how non-Java users
can support this.
#.# forcing the table creation and updates to externally recreate the schema for the benefit
of the readers. This is not as bad as messing with readers, cause those tables are mostly
created by Hive, but still bad (if external users do create the tables) and also doesn't solve
the external schema case.
# keeping the SerDe dependency
#.# recreating the schema. The old metastore approach before HIVE-11985 that nobody seems
to like.
#.# changing get_table/etc. APIs to return the correct schema from SerDe (with in-memory caching
for most cases, based on internal config?). This IMO is the right solution.
At compile time, the main dependency that metastore would need is "Deserializer" interface,
not individual SerDes, so it's a reasonable addition to storage-api (or a new module).
At runtime, I think it's reasonable to expect the user to deploy jars with metastore if they
want to use the table, since they'd likely need the same jars anyway to read from the table
using the SerDe  (although it does present some inconvenience to non-Java readers). Also,
if jars are not available we can output an error; and we can optionally add a compat flag
for users that are aware of Hive internals and can override the jar requirement.


> move custom SerDe schema considerations into metastore from QL
> --------------------------------------------------------------
>
>                 Key: HIVE-17714
>                 URL: https://issues.apache.org/jira/browse/HIVE-17714
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Alan Gates
>
> Columns in metastore for tables that use external schema don't have the type information
(since HIVE-11985) and may be entirely inconsistent (since forever, due to issues like HIVE-17713;
or for SerDes that allow an URL for the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA, and to MetaStoreUtils.getFieldsFromDeserializer,
you'd see that the code in QL handles this in Hive. So, for the most part metastore just returns
whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is interesting...
so getTable will return incorrect columns (potentially), but get_fields/get_schema will return
correct ones from SerDe as far as I can tell.
> As part of separating the metastore, we should make sure all the APIs return the correct
schema for the columns; it's not a good idea to have everyone reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message