hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <>
Subject [jira] Commented: (HIVE-126) Don't fetch information on Partitions from HDFS instead of MetaStore
Date Mon, 08 Dec 2008 19:30:44 GMT


Joydeep Sen Sarma commented on HIVE-126:

yes - the code was put in there as a safeguard. the history here is that we migrated our current
hive warehouse from an older version of the software and were worried about not capturing
all the older partitions in the new metastore. we kind of knew that the code was a hack -
but was a pure defensive measure.

couple of comments:
- we should move all metadata logic (including hacks if any :-)) - to the metastore server
side. otherwise we are creating a different view for Java vs. Thrift Clients.
-  yes - +1 on a fsck type command to replace this hack. i would actually like to run such
a command on our current tables before removing this hack.

the core issue is whether we can make this change without having  a fsck like utility in some
form (even a custom java program). That would also preserve some of the current code for handling
this case.


for a command line interface - one might want to check the entire database or just a table
or even just one partition. other metadata checks will also be added over time (for example
- do the file types on disk agree with metadata records, bucketing information etc). So, here's
a strawman proposal for a new command:

alter table <DB>[.TABLE [PARTITION-SPEC]] check [TYPE-LIST]

where TYPE by default is 'all' (check for all kinds of errors), but can be specified to a
specific type. For example - in this case - we can have a type called 'partitons' (and then
over time we can add other types like 'fileformat' etc.). for v1 - we can just drop the type-list

the check command can produce a list of things that need to be done to fix the format (like
adding any directories not in the metastore - but in hdfs - to the metastore). actually performing
of such steps would require a user confirmation (y/n).

Java interfaces. We have been pretty cavalier with Java interfaces. right now most of the
Hive public methods (other than the SerDe stuff) is not accessed by any codebase outside Hive.
So i would say just remove them for now - as we go through the code module by module - we
can identify those modules that we actually want to expose publicly. 

> Don't fetch information on Partitions from HDFS instead of MetaStore
> --------------------------------------------------------------------
>                 Key: HIVE-126
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 0.19.0
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>             Fix For: 0.19.0
>         Attachments: HIVE-126.patch
> When investigating HIVE-91 an issue came up where the information on what partitions
a table contains is loaded by listing the directories in the table directory on HDFS. This
is then used to overrule what is in the MetaStore if any difference is found. 
> * Would it not be preferable if MetaStore is the one authority on what the table contains?
> * It will also be a major hassle (or impossible?) to retrieve this information from HDFS
with external tables that have non standard partition names (HIVE-91), such as: table/2008/01/08/portugal
where "2008/01/08" is one partition value and "portugal" is another.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message