hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-2363) Implicitly CLUSTER BY when dynamically partitioning
Date Tue, 09 Aug 2011 22:53:27 GMT
Implicitly CLUSTER BY when dynamically partitioning
---------------------------------------------------

                 Key: HIVE-2363
                 URL: https://issues.apache.org/jira/browse/HIVE-2363
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Adam Kramer
            Priority: Critical


Whenever someone is dynamically creating partitions, the underlying implementation is to look
at the output data, write it to a file so long as the partition columns are contiguous, then
to close that file and open a new one if the partition column changes. This leads to potentially
way too many files generated.

The solution is to ensure that a partition column's data all appears in a row and on the same
reducer. I.e., to cluster by the partitioning columns on the way out.

This improvement is to detect whether a query is clustering by the eventual partition columns,
and if not, to do so as an additional step at the end of the query. This will potentially
save lots of space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message