hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Divya Gehlot <divya.htco...@gmail.com>
Subject RE: how does Hive Partitioning works ?
Date Wed, 30 Dec 2015 14:11:08 GMT
My apologies for making you confused .
Steps which I followed are :
1.I created the external table  with partitioning and gave the hdfs
location of the data (definitely the data files resides there )
My external table location ( inputs data directory structure) looks like
below:
/tmp/TestDivya/HiveInput/"many csvs files"
2.As I want to implement partitioning  on daily basis .  I haves used the
Alter table add partition (year=2015 , month=01 ,day=01) set location
'/tmp/Hiveinput/2015/01/01'

Now my confusion arises here,when is when I execute the alter statement :
1.As I already gave the location while creating the external table . How
does the above alter table statement works ?
Which data will get loaded here as I have so many  CSV field inside my give
input directory.
Sorry for the naive question .
But I am trying to understand how the alter table statement works .
As I will be executing only above two steps
1.Creating the external table with location and  partitioning.
2.altering the table by adding partition and giving location .

I do understand the loading the input data for particular location into the
partition .

But so much Confused at alter table. statement .

Once again Thanks a lot for your help and guidance.
Thanks,
Divya
On Dec 30, 2015 8:20 PM, "Mich Talebzadeh" <mich@peridale.co.uk> wrote:

> Hi,
>
>
>
> I gave a generic reply to partitioning and bucketing in Hive back in April
> for a similar question have a check on this. Here we go. Hope it helps
>
>
>
>
>
> As you may know already in RDBMS partitioning (dividing a very large table
> into sub-tables conceptually) is deployed to address three areast.
>
>
>
> 1.     Availability -- each partition can reside on a different
> tablespace/device. Hence a problem with a tablespace/device will take out a
> slice of the table's data instead of the whole thing. This does not really
> ap[ply to Hive with 3 block replication as standard
>
> 2.     Manageability -- partitioning provides a mechanism for splitting
> whole table jobs into clear batches. Partition exchange can make it easier
> to bulk load data. Defragging, moving older partitions to lower tier
> storage, updating stats etc Most of these benefits apply to Hive as well.
> Please check the docs.
>
> 3.     Performance -- partition elimination
>
>
>
> In simplest form (excluding composite partitioning), Hive partitioning
> will be similar to “range partitioning” in RDBMS. One can partition a table
> (say *partitioned_table* as shown below which is batch loaded from
> *non_partitioned_table*) -- by country, year, month etc. Each partition
> will be stored in Hive under sub-directory *table/year/month* like below
>
>
>
> /user/hive/warehouse/scratchpad.db
> */partitioned_table/country=Italy/year=2014/month=Feb*
>
>
>
> Hive does not have the concept of indexes local or global as yet. So
> without partitioning a simple query in Hive will have to read the entire
> table even if it is filtering a smaller result set (WHERE CLAUSE). This
> becomes a bottleneck for running multiple MapReduce jobs over a large table. So
> partitioning will help localise the query by hitting the relevant
> sub-directory or sub-directories only. There is another important aspect
> with Hive as well. The locking granularity will be determined by the lowest
> slice in the filing system (sub-directory). So entering data into the above
> partition/file, will take an exclusive lock on that partition/file but
> crucially the rest of partitions will be available (assuming concurrency in
> Hive is enabled).
>
>
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
> |  lockid  |  database   |         table          |
> partition              | lock_state  |  lock_type   | transaction_id  |
> last_heartbeat  |  acquired_at   |  user   | hostname  |
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
> | Lock ID  | Database    | Table                  |
> Partition                          | State       | Type         |
> Transaction ID  | Last Hearbeat   | Acquired At    | User    | Hostname  |
>
> | 1711     | scratchpad  | non_partitioned_table  |
> NULL                               | ACQUIRED    | *SHARED_READ*  |
> NULL            | 1428862154670   | 1428862151904  | hduser  | rhes564   |
>
> | 1711     | scratchpad  | *partitioned_table      |
> country=Italy/year=2014/month=Feb*  | ACQUIRED    | *EXCLUSIVE *   |
> NULL            | 1428862154670   | 1428862151905  | hduser  | rhes564   |
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
>
>
> Now your point 2, bucketing in Hive refers to hash partitioning where a
> hashing function is applied. Likewise an RDBMS, Hive will apply a linear
> hashing algorithm to prevent data from clustering within specific
> partitions. Hashing is very effective if the column selected for bucketing
> has very high selectivity like an ID column where selectivity (*select
> count(distinct(column))/count(column)* ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a *partition wise join*
> i.e. a join between two tables that are bucketed on the same column with
> the same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Divya Gehlot [mailto:divya.htconex@gmail.com]
> *Sent:* 30 December 2015 10:44
> *To:* user@hive.apache.org
> *Subject:* how does Hive Partitioning works ?
>
>
>
> Hi,
>
> I am new bee to hive and trying to understand the hive partitioning .
> My files are in CSV format
> Steps which I followed
> CREATE EXTERNAL TABLE IF NOT EXISTS loan_depo_part(COLUMN1 String ,COLUMN2
> String ,COLUMN3 String ,
>                                                COLUMN4 String,COLUMN5
> String,COLUMN6 String,
>                                                COLUMN7 Int ,COLUMN8 Int
> ,COLUMN9 String ,
>                                                COLUMN10 String ,COLUMN11
> String ,COLUMN12 String,
>                                                COLUMN13 String ,COLUMN14
> String ,
>                                                COLUMN15 String ,COLUMN16
> String ,
>                                                COLUMN17 String ,COLUMN18
> String ,
>                                                COLUMN19 String ,COLUMN20
> String ,
>                                                COLUMN21 String ,COLUMN22
> String )
> COMMENT 'testing Partition'
> PARTITIONED BY (Year String,Month String ,Day String)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> TBLPROPERTIES ("skip.header.line.count"="1") ;
>
>
>
> ALTER TABLE loan_depo_part ADD IF NOT EXISTS PARTITION(
> Year=2015,Month=01,Day=01);
>
> ALTER TABLE loan_depo_part PARTITION(Year=2015,Month=01,Day=01)
>  SET LOCATION
> 'hdfs://namenode:8020/tmp/TestDivya/HiveInput/year=2015/month=01/day=01/';
>
>
>
> Whereas my HDFS data location is
> /TestDivya/HiveInput/year=2015/month=01/day=01/
>
> I have few queries regarding the above partioning :
>
> 1. It creates the table when run the second step and gives the select
> command it doesnt diplay any data
>
> 2. Do I need to create normal external table first and the partitioned one
> next
>
>  and then do the insert overwrite.
>
> Basically I am not able to understand the partioning things mentioned
> above
>
> I followed this link
> <http://deanwampler.github.io/polyglotprogramming/papers/Hive-SQLforHadoop.pdf>
>
> Would really appreciate the help/pointers.
>
> Thanks,
>
> Divya
>
>
>

Mime
View raw message