hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: Proposal: File based metastore
Date Tue, 30 Jan 2018 19:09:24 GMT
On Tue, Jan 30, 2018 at 1:16 PM, Ryan Blue <> wrote:

> Thanks, Owen.
> I agree, Iceberg addresses a lot of the problems that you're hitting here.
> It doesn't quite go as far as moving all metadata into the file system. You
> can do that in HDFS and implementations that support atomic rename, but not
> in S3 (Iceberg has an implementation of the HDFS one strategy). For S3 you
> need some way of making commits atomic, for which we are using a metastore
> that is far more light-weight. You could also use a ZooKeeper cluster for
> write-side locking, or maybe there are other clever ideas out there.
> Even if Iceberg is agnostic to the commit mechanism, it does almost all of
> what you're suggesting and does it in a way that's faster than the current
> metastore while providing snapshot isolation.
> rb
> On Mon, Jan 29, 2018 at 9:10 AM, Owen O'Malley <>
> wrote:
>> You should really look at what the Netflix guys are doing on Iceberg.
>> They have put a lot of thought into how to efficiently handle tabular
>> data in S3. They put all of the metadata in S3 except for a single link to
>> the name of the table's root metadata file.
>> Other advantages of their design:
>>    - Efficient atomic addition and removal of files in S3.
>>    - Consistent schema evolution across formats
>>    - More flexible partitioning and bucketing.
>> .. Owen
>> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <>
>> wrote:
>>> All,
>>> I have been bouncing around the earth for a while and have had the
>>> privilege of working at 4-5 places. On arrival each place was in a variety
>>> of states in their hadoop journey.
>>> One large company that I was at had a ~200 TB hadoop cluster. They
>>> actually ran PIG and there ops group REFUSED to support hive, even though
>>> they had written thousands of lines of pig macros to deal with selecting
>>> from a partition, or a pig script file you would import so you would know
>>> what the columns of the data at location /x/y/z is.
>>> In another lifetime I have been at a shop that used SCALDING. Again lots
>>> of custom effort there with avro and parquet, all to do things that hive
>>> would do our of the box. Again the biggest challenge is the thrift service
>>> and metastore.
>>> In the cloud many people will use a bootstrap script
>>> op-script.html or 'msck repair'
>>> The "rise of the cloud" has changed us all the metastore is being a
>>> database is a hard paradigm to support. Imagine for example I created data
>>> to an s3 bucket with hive, and another group in my company requires read
>>> only access to this data for an ephemeral request. Sharing the data is
>>> easy, S3 access can be granted, sharing the metastore and thrift services
>>> are much more complicated.
>>> So lets think out of the box:
>>> ssandra-together-at-last
>>> Datastax was able to build a platform where the filesystem and the
>>> metastore were backed into Cassandra. Even though a HBase user would not
>>> want that, the novel thing about that approach is that the metastore was
>>> not "some extra thing in a database" that you had to deal with.
>>> What I am thinking is that for the user of s3, the metastore should be
>>> in s3. Probably in hidden files inside the warehouse/table directory(ies).
>>> Think of it as msck repair "on the fly" "
>>> nowledgecenter/SSPT3X_4.2.5/
>>> ights.commsql.doc/doc/biga_msckrep.html"
>>> The implementation could be something like this:
>>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>>> us locate the databases and in the databases we can locate tables, with the
>>> tables we can locate partitions.
>>> This will of course scale horribly across tables with 90000000
>>> partitions but that would not be our use case. For all the people with
>>> "msck repair" in the bootstrap they have a much cleaner way of using hive.
>>> The implementations could even be "Stacked" files first metastore
>>> lookback second.
>>> It would be also wise to have a tool available in the CLI "metastore
>>> <table> toJson" making it drop dead simple to export the schema
>>> definitions.
>>> Thoughts?
> --
> Ryan Blue


Super great work by the way. Some of the mechanisms are things that Hive
could do and in some cases already does. For a long time we have had:

" implementations that support atomic rename," is essentially

* This input split wraps the FileSplit generated from
* TextInputFormat.getSplits(), while setting the original link file path
* as job input path. This is needed because MapOperator relies on the
* job input path to lookup correct child operators. The target data file
* is encapsulated in the wrapped FileSplit.

We already in some cases intercept FileInputFormat getSplits()

Impala it also maintains its own file metadata which if edited outside of
impala falls out of sync. IE if I "insert into" a partition from hive
impala  is unaware and you have to issue "refresh"

Things like write side locking using ZK are more of an implementation
detail. I agree that it is non trivial, but if there are scores of people

On is a snapshot isolation is a "super tight" version of
SymlinkTextOutputFormat. IE I can "atomically" compose a file that
describes what files should be in the table, and during planning phase I do
not need Hadoop to calculate splits.

View raw message