hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nishant Kelkar <nishant....@gmail.com>
Subject Re: Data in Hive
Date Thu, 14 Aug 2014 20:06:18 GMT
Hi there!

So Hive stores it's data in HDFS. That means, it is distributed by default.
The distribution factor is controlled by parameters, specifically the block
size(dfs.block.size). The file splits are also replicated, meaning that if
a data node were to fail, it's replicas on other nodes would be available
to serve the content lost on the failed data node. Here's a good
StackOverflow thread that discusses the file split issue:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

When Hive creates a table, it basically maintains metadata like schema
information, comments about each column, HDFS file location, input formats,
etc. So it is a layer of abstraction over a raw HDFS file, in essence.
Here's a good StackOverflow question that attempts to provide some details:
http://stackoverflow.com/questions/17065672/what-does-the-hive-metastore-and-name-node-do-in-a-cluster

So to answer your question, no, Hive does not move all data to one location
and create a single table. The whole point of using MapReduce as a
framework is to take the compute to the data, not vice versa.

Hope that helps!

Thanks and Regards,
Nishant Kelkar


On Thu, Aug 14, 2014 at 7:23 AM, CHEBARO Abdallah <
Abdallah.CHEBARO@murex.com> wrote:

>  My target is to perform a SELECT query using Hive
>
>
>
> When I have a small data on a single machine (namenode), I start by:
>
> 1-Creating a table that contains this data: create table table1 (int col1,
> string col2)
>
> 2-Loading the data from a file path: load data local inpath 'path' into
> table table1;
>
> 3-Perform my SELECT query: select * from table1 where col1>0
>
>
>
> I have huge data, of 10 millions rows that doesn't fit into a single
> machine. Lets assume Hadoop divided my data into for example 10 datanodes
> and each datanode contains 1 million row.
>
>
>
> Retrieving the data to a single computer is impossible due to its huge
> size or would take alot of time in case it is possible.
>
>
>
> Will Hive create a table at each datanode and perform the SELECT query
>
> or will Hive move all the data a one location (datanode) and create one
> table? (which is inefficient)
>
> *******************************
>
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. If you are not
> the intended recipient you are not authorised to distribute, copy or use
> this e-mail or any attachment to it. Murex cannot guarantee that it is
> virus free and accepts no responsibility for any loss or damage arising
> from its use. If you have received this e-mail in error please notify
> immediately the sender and delete the original email received, any
> attachments and all copies from your system.
>

Mime
View raw message