hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Navarette <>
Subject RE: Resources/Distributed Cache on Spark
Date Wed, 14 Feb 2018 19:21:29 GMT
Sorry for the resend, but does anyone know who I might best talk to about this?  Would it be
worthwhile to bring this question to the dev list?

Thanks again for the help,

From: Ray Navarette []
Sent: Thursday, February 8, 2018 6:33 PM
Subject: RE: Resources/Distributed Cache on Spark

Without using add files, we’d have to make sure these resources exist on every node, and
would configure a hive session like this:
set myCustomProperty=/path/to/directory/someSubDir/;
select myCustomUDF(‘param1’,’param2’);

With the shared resources, we can do this instead, at least with MR engine:
add files file:///path/to/directory;
set myCustomProperty=someSubDir/;
select myCustomUDF(‘param1’,’param2’);

In both cases, the property myCustomProperty is accessed inside the custom UDF, interpreted
as a path, and used to read the content of a file within “someSubDir”.  This works fine
whenever we have the full path, or with the relative path in the MR engine when using add
resources.  I’m wondering if perhaps I’m getting lucky in that the MR engine is downloading
the files to the working directory, and so the relative path is being properly resolved there,
but some different behavior is happening in spark?  I can give a full path if I know ahead
of time where this file will be available on the remote node, hopefully by property, like

Thanks for the quick response and your help with this.


From: Sahil Takiar []
Sent: Thursday, February 8, 2018 12:45 PM
Subject: Re: Resources/Distributed Cache on Spark

It should work. We have tests such as groupby_bigdata.q that run on HoS and work. They use
the "add file" command. What are the exact commands you are running? What error are you seeing?

On Thu, Feb 8, 2018 at 6:28 AM, Ray Navarette <<>>

I’m hoping to find some information about using “ADD FILES <PATH>” when using
the spark execution engine.  I’ve seen some jira tickets reference this functionality, but
little else.  We have written some custom UDFs which require some external resources.  When
using the MR execution engine, we can reference the file paths using a relative path and they
are properly distributed and resolved.  When I try to do the same under spark engine, I receive
an error saying the file is unavailable.

Does “ADD FILES <PATH>” work on spark, and if so, how should I properly reference
those files in order to read them in the executors?

Thanks much for your help,

Sahil Takiar
Software Engineer<> | (510) 673-0309
View raw message