hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <>
Subject Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)
Date Thu, 20 Jun 2013 15:57:44 GMT
To demonstrate this is not necessarily a path issue - but instead an issue
with the "archive" is not unpacked  -  I have created a zip file containing
a python script in its root directory.  The archive is added to hive and
then an attempt is made to invoke the python script within a transform
query. But we get a "file not found" from the map Task - indicating that
the archive is not being exploded.

Show that the python script "" is resident in the
*root *directory
of the zip file:
e$ jar -tvf | grep
 11241 Tue Jun 18 19:37:02 UTC 2013

Add the archive to hive:
   hive> add archive /opt/am/ver/1.0/hive/;
   Added resource: /opt/am/ver/1.0/hive/

Run a transform query:

  hive>    from (select transform (aappname,qappname) using
'' as (aappname2 string, qappname2 string) from eqx ) o
insert overwrite table c select o.aappname2, o.qappname2;

Get an error:   ;)

Check the logs:

Caused by: Cannot run program "": error=2, No such file or directory

2013/6/20 Stephen Boesch <>

> @Stephen:  given the  'relative' path for hive is from a local downloads
> directory on each local tasktracker in the cluster,  it was my thought that
> if the archive were actually being expanded then
> somedir/somefileinthearchive  should work.  I will go ahead and test this
> assumption.
> In the meantime, is there any facility available in hive for making
> archived files available to hive jobs?  archive or hadoop archive ("har")
> etc?
> 2013/6/20 Stephen Sprague <>
>> what would be interesting would be to run a little experiment and find
>> out what the default PATH is on your data nodes.  How much of a pain would
>> it be to run a little python script to print to stderr the value of the
>> environmental variable $PATH and $PWD (or the shell command 'pwd') ?
>> that's of course going through normal channels of "add file".
>> the thing is given you're using a relative path "hive/"  you
>> need to know what the "current directory" is when the process runs on the
>> data nodes.
>> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <>wrote:
>>> We have a few dozen files that need to be made available to all
>>> mappers/reducers in the cluster while running  hive transformation steps .
>>> It seems the "add archive"  does not make the entries unarchived and
>>> thus available directly on the default file path - and that is what we are
>>> looking for.
>>> To illustrate:
>>>    add file modelfile.1;
>>>    add file modelfile.2;
>>>    ..
>>>     add file modelfile.N;
>>>   Then, our model that is invoked during the transformation step *does *have
>>> correct access to its model files in the defaul path.
>>> But .. those model files take low *minutes* to all load..
>>> instead when we try:
>>>    add archive  modelArchive.tgz.
>>> The problem is the archive does not get exploded apparently ..
>>> I have an archive for example that contains shell scripts under the
>>> "hive" directory stored inside.  I am *not *able to access
>>> hive/  after adding the archive. Specifically the
>>> following fails:
>>> $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
>>> -rwxrwxr-x stephenb/stephenb    664 2013-06-18 17:46
>>> appminer/bin/
>>> from (select transform (aappname,qappname)
>>> *using *'*hive/*' as (aappname2 string, qappname2 string)
>>> from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;
>>> Cannot run program "hive/": error=2, No such
file or directory

View raw message