hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-11500) implement file footer / splits cache in HBase metastore
Date Sat, 08 Aug 2015 00:51:45 GMT

     [ https://issues.apache.org/jira/browse/HIVE-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin updated HIVE-11500:
------------------------------------
    Description: 
We need to cache file metadata (e.g. ORC file footers) for split generation (which, on FSes
that support fileId, will be valid permanently and only needs to be removed lazily when ORC
file is erased or compacted), and potentially even some information about splits (e.g. grouping
based on location that would be good for some short time), in HBase metastore.
It should be queryable by table. Partition predicate pushdown should be supported. If bucket
pruning is added, that too. 

In later phases, it would be nice to save the (first category above) results of expensive
work done by jobs, e.g. data size after decompression/decoding per column, etc. to avoid surprises
when ORC encoding is very good, or very bad. Perhaps it can even be lazily generated. Here's
a pony: 🐴

  was:
We need to cache footer data for split generation (which, on FSes that support fileId, will
be valid permanently and only needs to be removed lazily when ORC file is erased or compacted),
and potentially even some information about splits (e.g. grouping based on location that would
be good for some short time), in HBase metastore.
It should be queryable by table. Partition predicate pushdown should be supported. If bucket
pruning is added, that too. 

In later phases, it would be nice to save the (first category above) results of expensive
work done by jobs, e.g. data size after decompression/decoding per column, etc. to avoid surprises
when ORC encoding is very good, or very bad. Perhaps it can even be lazily generated. Here's
a pony: 🐴


> implement file footer / splits cache in HBase metastore
> -------------------------------------------------------
>
>                 Key: HIVE-11500
>                 URL: https://issues.apache.org/jira/browse/HIVE-11500
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>
> We need to cache file metadata (e.g. ORC file footers) for split generation (which, on
FSes that support fileId, will be valid permanently and only needs to be removed lazily when
ORC file is erased or compacted), and potentially even some information about splits (e.g.
grouping based on location that would be good for some short time), in HBase metastore.
> It should be queryable by table. Partition predicate pushdown should be supported. If
bucket pruning is added, that too. 
> In later phases, it would be nice to save the (first category above) results of expensive
work done by jobs, e.g. data size after decompression/decoding per column, etc. to avoid surprises
when ORC encoding is very good, or very bad. Perhaps it can even be lazily generated. Here's
a pony: 🐴



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message