spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Fiorito <silvio.fior...@granturing.com>
Subject Re: Using HiveContext.set in multipul threads
Date Tue, 24 May 2016 12:11:51 GMT
If you’re using DataFrame API you can achieve that by simply using (or not) the “partitionBy”
method on the DataFrameWriter:

val originalDf = ….

val df1 = originalDf….
val df2 = originalDf…

df1.write.partitionBy(”col1”).save(…)

df2.write.save(…)

From: Amir Gershman <amirg@fb.com>
Date: Tuesday, May 24, 2016 at 7:01 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Using HiveContext.set in multipul threads

Hi,

I have a DataFrame I compute from a long chain of transformations.
I cache it, and then perform two additional transformations on it.
I use two Futures - each Future will insert the content of one of the above Dataframe to a
different hive table.
One Future must SET hive.exec.dynamic.partition=true and the other must set it to false.



How can I run both INSERT commands in parallel, but guarantee each runs with its own settings?



If I don't use the same HiveContext then the initial long chain of transformations which I
cache is not reusable between HiveContexts. If I use the same HiveContext, race conditions
between threads my cause one INSERT to execute with the wrong config.

Mime
View raw message