hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vihang Karajgaonkar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17714) move custom SerDe schema considerations into metastore from QL
Date Sun, 12 Nov 2017 19:37:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248970#comment-16248970
] 

Vihang Karajgaonkar commented on HIVE-17714:
--------------------------------------------

I looked into this a bit more and followed the history of the changes to SerDes related to
this. Initially, I thought of move Serializer, Deserializer and AbstractSerde classes to storage-api.
This turned out to be pretty straight-forward with no backward compatibility implications
since the package name still remains the same of the moved classes.

However, this may not solve the problem entirely because it still means that standalone Metastore
JVM will need these jars in its classpath to instantiate and get the schema from Deserializer
in the runtime. SerDe implementations are spread all over the code and I am afraid that bringing
one jar will bring in the rest of the world in terms of dependencies. This is probably not
an issue in embedded mode of metastore though because metastore resides in the HS2 process
and will have access to all the hive jars anyways, but in case of remote standalone metastore
it doesn't make sense to add all these jars in the class path in the runtime.

I also was a bit confused by this [line of code here | https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L980
] in {{Table.java}} where it says that any SerDe which is a subclass of AbstractSerDe should
store the fields information in metastore. While {{AbstractSerDe}} itself returns {{false}}
in {{shouldStoreFieldsInMetastore}} which is contradictory.

Based on what I have looked so far there is no easy way out for this and HIVE-17580 to solve
it consistently for all the use-cases without breaking backwards compatibility. I propose
we make the following changes:

1.  Change {{AbstractSerDe:shouldStoreFieldsInMetastore}} to return {{true}} 
It still behaves as if its true based on what we see in Table.java above and claim that all
the SerDes implementations which extend from AbstractSerDe will store schema in metastore
unless explicitly overridden to return false. This should cover all the SerDes in Hive source
code since HIVE-15167 moved them to subclass from AbstractSerDe instead of directly implementing
interfaces.
2. We move the Serializer, Deserializer and AbstractSerDe classes to storage-api.
This enables metastore to consume them without having to create a compile time dependency
on hive.
3. We claim that if there are users who implement directly from the Serializer/Deserializer
interfaces and still want metastore to store schema for them should make sure that their jar
can be added into the classpath of the standalone metastore and metastore will use the existing
mechanism to load and deserialize from the Serde class.
4. Add the check in {{HiveMetaStoreUtils.getFieldsFromDeserializer}} to throw exception before
trying to use deserializer to get the schema if the implementation of {{shouldStoreFieldsInMetastore}}
returns false. I don't think metastore can ever be 100% sure if SerDes declares that fields
are not supposed to be stored in metastore.

[~sershe] and [~alangates] What do you guys think about these suggestions?

> move custom SerDe schema considerations into metastore from QL
> --------------------------------------------------------------
>
>                 Key: HIVE-17714
>                 URL: https://issues.apache.org/jira/browse/HIVE-17714
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Alan Gates
>
> Columns in metastore for tables that use external schema don't have the type information
(since HIVE-11985) and may be entirely inconsistent (since forever, due to issues like HIVE-17713;
or for SerDes that allow an URL for the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA, and to MetaStoreUtils.getFieldsFromDeserializer,
you'd see that the code in QL handles this in Hive. So, for the most part metastore just returns
whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is interesting...
so getTable will return incorrect columns (potentially), but get_fields/get_schema will return
correct ones from SerDe as far as I can tell.
> As part of separating the metastore, we should make sure all the APIs return the correct
schema for the columns; it's not a good idea to have everyone reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message