hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <>
Subject [jira] Commented: (HIVE-599) Embedded Hive SQL into Python
Date Tue, 14 Jul 2009 18:13:15 GMT


Edward Capriolo commented on HIVE-599:

Speaking on the behalf of the python illiterate, it would be nice to have a simple java overlay.
This could be used by thrift, jdbc, and java command line applications. 

I opened HIVE-617 around a need like this.
Here is my use case:
I have data being pulled and written to a table raw_web_log. Each hour I need to generate
some summary data. My date and hour fields are dynamic. 

Bash type scripting was going to be complicated as date operations were going to be complex
if not impossible with pure sh/bash. I had to do things like get the current hour and figure
out the last hour. From my experience with the HWI I know how to chop apart the CLIDriver
and get what I want from it. This is what I came up with.

public class StatusHourBuilder {

  public static void main(String [] args) throws Exception {

    OptionsProcessor oproc = new OptionsProcessor();
    if(! oproc.process_stage1(args)) {
      System.out.println("Problem processing args");

    CliSessionState ss = new CliSessionState (new HiveConf(SessionState.class)); =;
    ss.out = new PrintStream(System.out,true, "UTF-8");
    ss.err = new PrintStream(System.err,true, "UTF-8");


    if(! oproc.process_stage2(ss)) {
      System.out.println("Problem with stage2");

    SetProcessor sp=new SetProcessor();
    AddResourceProcessor ap = new AddResourceProcessor();

    Driver qp=new Driver();
    int ret = -1;
    int sret=-1;
    //Log LOG = LogFactory.getLog("CliDriver");
    //LogHelper console = new LogHelper(LOG);
    sret ="");
    if (ret !=0){
      System.err.println("set processor failed");

    sret =" mapred.reduce.tasks=1");
    if (ret !=0){
      System.err.println("set processor failed");
    ret =" CREATE TABLE IF NOT EXISTS raw_web_data_hour_status (status int, count
int) "+
        " PARTITIONED BY ( log_date_part STRING, log_hour_part STRING )" +
        " FIELDS TERMINATED BY '\037' " +
        " LINES TERMINATED BY '\012' " );

    if (ret !=0){
      System.err.println("Create table problem");

    DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
    DateFormat hourFormat = new SimpleDateFormat("kk");

    GregorianCalendar today = new GregorianCalendar();
    today.add(today.HOUR_OF_DAY, -1);

    String theDate= dateFormat.format( today.getTime() );
    String theHour= hourFormat.format( today.getTime() );
    System.out.println("Generating Run For "+theDate+" "+theHour);

    ret =
      " insert overwrite table raw_web_data_hour_status "+
      " partition (log_date_part='"+theDate+"', log_hour_part='"+theHour+"') "+
      " select http_status,count(1) from raw_web_data_hour where " +
      " log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' "+
      " group by (http_status)   "
    Vector <String> res = new Vector<String>();
    while (qp.getResults(res)) {
        System.out.println("ResSize:"+ res.size());
        for (String row:res){

  } // end main
} // end TestHive

Looking at what I did there are some upsides and some downsides.
1) I need like a prepared statement like feature...
g_date_part='"+theDate+"', log_hour_part='"+theHour+"') "+
2) I need more 'user friendly' features. Some type of overlay so I need less imports, I touch
less of the API

1) All java
2) Can handle exceptions and return codes

Does anyone else see the need for a 'direct' java API? Or maybe we should document the components
to be used in style like in my example? 

> Embedded Hive SQL into Python
> -----------------------------
>                 Key: HIVE-599
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
> While Hive does SQL it would be very powerful to be able to embed that SQL in languages
like python in such a way that the hive query is also able to invoke python functions seemlessly.
One possibility is to explore integration with Dumbo. Another is to see if the internal
tool can be open sourced as a Hive contrib.
> Other thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message