hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julian Hyde (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load
Date Wed, 15 Oct 2014 18:51:37 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172767#comment-14172767
] 

Julian Hyde commented on HIVE-8467:
-----------------------------------

I see this as a particular kind of materialized view. In general, a materialized view is a
table whose contents are guaranteed to be the same as executing a particular query. In this
case, that query is simply 'select * from t'.

We don't have materialized view support yet, but I have been working on lattices in Calcite
(formerly known as Optiq) (see OPTIQ-344) and there is a lot of interest in adding them to
Hive. Each materialized "tile" in a lattice is a materialized view of the form 'select d1,
d2, sum(m1), count(m2) from t group by d1, d2'.

So, let's talk about whether we could change the syntax to 'create materialized view'  and
still deliver the functionality you need. Of course if the user enters anything other than
'select * from t order by k1, k2' they would get an error.

In terms of query planning, I strongly recommend that you build on the CBO work powered by
Calcite. Let's suppose there is a table T and a copy C. After translating the query to a Calcite
RelNode tree, there will be a TableAccessRel(T). After reading the metadata, we should create
a TableAccessRel(C) and tell Calcite that it is equivalent.

That's all you need to do. Calcite will take it from there. Assuming the stats indicate that
C is better (and they should, right, because the ORC representation will be smaller?) then
the query will end up using C. But if, say, T has a partitioning scheme which is more suitable
for a particular query, then Calcite will choose T.

> Table Copy - Background, incremental data load
> ----------------------------------------------
>
>                 Key: HIVE-8467
>                 URL: https://issues.apache.org/jira/browse/HIVE-8467
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Rajat Venkatesh
>         Attachments: Table Copies.pdf
>
>
> Traditionally, Hive and other tools in the Hadoop eco-system havent required a load stage.
However, with recent developments, Hive is much more performant when data is stored in specific
formats like ORC, Parquet, Avro etc. Technologies like Presto, also work much better with
certain data formats. At the same time, data is generated or obtained from 3rd parties in
non-optimal formats such as CSV, tab-limited or JSON. Many a times, its not an option to change
the data format at the source. We've found that users either use sub-optimal formats or spend
a large amount of effort creating and maintaining copies. We want to propose a new construct
- Table Copy - to help “load” data into an optimal storage format.
> I am going to attach a PDF document with a lot more details especially addressing how
is this different from bulk loads in relational DBs or materialized views.
> Looking forward to hear if others see a similar need to formalize conversion of data
to different storage formats.  If yes, are the details in the PDF document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message