hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-467) Scratch data location should be on different filesystems for different types of intermediate data
Date Sat, 16 May 2009 04:14:45 GMT

     [ https://issues.apache.org/jira/browse/HIVE-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joydeep Sen Sarma updated HIVE-467:
-----------------------------------

    Attachment: hive-467.4.patch

would be great if someone could review the rest of the patch. some javadocs were missing on
the new public apis - attaching another one.

he changes are relatively trivial - they just consolidate all the tmp file logic in one place
and provide four public api calls from Context.java:

  public boolean isMRTmpFileURI(String uriStr)
  public String getMRTmpFileURI() 
  public String getLocalTmpFileURI()   
  public String getExternalTmpFileURI(URI extURI)

the semantics are obvious. Context.java is also simplified to not be a reusable object. There
are a small number of indent only changes in BaseSemanticAnalyzer.java. Quite a bit of code
reduction across the different parts of the compiler that had their own tmp file allocation
logic. 



> Scratch data location should be on different filesystems for different types of intermediate
data
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-467
>                 URL: https://issues.apache.org/jira/browse/HIVE-467
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>         Environment: S3/EC2
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-467.3.patch, hive-467.4.patch, hive-467.patch.1, hive-467.patch.2
>
>
> Currently Hive uses the same scratch directory/path for all sorts of temporary and intermediate
data. This is problematic:
> 1. Temporary location for writing out DDL output should just be temp file on local file
system. This divorces the dependence of metadata and browsing operations on a functioning
hadoop cluster.
> 2. Temporary location of intermediate map-reduce data should be the default file system
(which is typically the hdfs instance on the compute cluster)
> 3. Temporary location for data that needs to be 'moved' into tables should be on the
same file system as the table's location (table's location may not be same as hdfs instance
of processing cluster).
> ie. - local storage, map-reduce intermediate storage and table storage should be distinguished.
Without this distinction - using hive on environments like S3/EC2 causes problems. In such
an environment - i would like to be able to:
> - do metadata operations without a provisioned hadoop cluster (using data stored in S3
and metastore on local disk)
> - attach to a provisioned hadoop cluster and run queries
> - store data back in tables that are created over s3 file system

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message