hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Murthy <>
Subject Re: first tests with hive
Date Thu, 26 Feb 2009 11:44:26 GMT
Hi Arijit,

Hive uses HDFS as its underlying storage. When a table is partitioned by
some column, say 'ds', hive stores partitions in separate HDFS directories.
So, if there are two partitions with ds values '2009-02-20' and
'2009-02-21', their corresponding data is stored in the HDFS directories:
This allows Hive to scan only the HDFS files necessary for a given query
(via input pruning). For more information about the hive data model see

The actual distribution of data in files across nodes is determined by HDFS
which supports different levels of replication. See for more

Hope this helps.

On 2/26/09 3:30 AM, "Arijit Mukherjee" <> wrote:

> Hi All
> I'm a newbie to hadoop and hive and am trying to set it up on a cluster. I am
> trying to find out more about the partitioning as done in Hive. If I use a
> create table statement with a "partitioned by" clause, which as per the
> documentation is a virtual column, is the data physically partitioned on
> multiple nodes (meaning would the different nodes have different subsets of
> the actual data)? Is it possible to check the content of each partition?
> Actually, I'm trying to compare the concepts of Hive with some other
> frameworks such as Greenplum where the data is distributed across nodes.
> Any help/pointers is appreciated. Thanx in advance.
> Cheers
> Arijit

View raw message