hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <>
Subject [jira] Commented: (HIVE-493) automatically infer existing partitions of table from HDFS files.
Date Thu, 02 Jul 2009 15:33:47 GMT


Edward Capriolo commented on HIVE-493:

I have a USE CASE for something similar and I wanted to get peoples opinion on it. My intake
process is a map reduce job that takes as input a list of servers. On these servers I connect
via FTP and take all the new files. We are doing 5 minute logs. 

I have a map only job that writes the Files to a static HDFS folder. After the map process
is complete I am presented with exactly this problem. 

Do I assume the partition is created, and copy the files? I decided to let hive handle this

  String hql=" load data inpath '"+conf.get("")+"/user/ecapriolo/pull/raw_web_log/"+p.getName()+
                   "' into table raw_web_data partition (log_date_part='"+dateFormat.format(today.getTime())+"')";
      System.out.println("Running "+hql);
      String [] run = new String [] { "/opt/hive/bin/hive", "-e", hql };

 LoadThread lt = new LoadThread(run);
 Thread t = new Thread(lt);

Personally, I do not think we should let users infer into the table layout of hive. Users
should have tools, whether these be API based or HQL based tools.  I should not have to mix
match between hive -e 'something', map/reduce, bash scripting to get a job accomplished (I
spent 4 hours trying to get the environment correct for my forked 'hive -e query') (I probably
should learn more about the thrift API )

But that problem I already solved. My next problem is also important to this discussion. I
now have too many files inside my directory. I am partitioned by day, but each server is dropping
5 minute log files. What I really need now is a COMPACT function. To merge all these 5 minute
data files into one.  What would be the proper way to handle this? I could take an all query
based approach, by selecting all the data into a new table. Then I need to drop the partition
and selecting the data back into the original table. However I could short circuit the operations
(and save time) by building the new partition first, deleting the old data, and then moving
the new data it back using 'dfs mv'

Should this be a done through HQL " Compact table X partiton Y "? Or should a command like
service be done? bin/hive --service compact table X partition Y. Doing it all though HQL is
possible now, but not optimized in some cases. Unless I am missing something. 

I think we need more easily insight into the metastore from HQL like how mysql does. show
tables is a good step but we need something like a virtual read only schema table to query.

Sorry to be all over the place on this post.

> automatically infer existing partitions of table from HDFS files.
> -----------------------------------------------------------------
>                 Key: HIVE-493
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
> Initially partition list for a table is inferred from HDFS directory structure instead
of looking into metastore (partitions are created using 'alter table ... add partition').
but this automatic inferring was removed to favor the later approach during checking-in metastore
checker feature and also to facilitate external partitions.
> Joydeep and Frederick mentioned that it would simple for users to create the HDFS directory
and let Hive infer rather than explicitly add a partition. But doing that raises following...
> 1) External partitions -- so we have to mix both approaches and partition list is merged
list of inferred partitions and registered partitions. and duplicates have to be resolved.
> 2) Partition level schemas can't supported. Which schema to chose for the inferred partitions?
the table schema when the inferred partition is created or the latest tale schema? how do
we know the table schema when the inferred partitions is created?
> 3) If partitions have to be registered the partitions can be disabled without actually
deleting the data. this feature is not supported and may not be that useful but nevertheless
this can't be supported with inferred partitions
> 4) Indexes are being added. So if partitions are not registered then indexes for such
partitions can not be maintained automatically.
> I would like to know what is the general thinking about this among users of Hive. If
inferred partitions are preferred then can we live with restricted functionality that this

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message