spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From swetha kasireddy <swethakasire...@gmail.com>
Subject Re: How to join an RDD with a hive table?
Date Mon, 15 Feb 2016 17:27:03 GMT
OK. would it only query for the records that I want in hive as per filter
or just load the entire table? My user table will have millions of records
and I do not want to cause OOM errors by loading the entire table in memory.

On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh <mich@peridale.co.uk>
wrote:

> Also worthwhile using temporary tables for the joint query.
>
>
>
> I can join a Hive table with any other JDBC accessed table from any other
> databases with DF and temporary tables
>
>
>
> //
>
> //Get the FACT table from Hive
>
> //
>
> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
> oraclehadoop.sales")
>
>
>
> //
>
> //Get the Dimension table from Oracle via JDBC
>
> //
>
> val c = HiveContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
> sh.channels)",
>
> "user" -> "sh",
>
> "password" -> "xxx"))
>
>
>
>
>
> s.registerTempTable("t_s")
>
> c.registerTempTable("t_c")
>
>
>
> And do the join
>
>
>
> SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)
>
> FROM
>
> (
>
> SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
> SUM(t_s.AMOUNT_SOLD) AS TotalSales
>
> FROM t_s, t_t, t_c
>
> WHERE t_s.TIME_ID = t_t.TIME_ID
>
> AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID
>
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 1000
>
> """
>
> HiveContext.sql(sqltext).collect.foreach(println)
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhihong@gmail.com]
> *Sent:* 15 February 2016 08:44
> *To:* SRK <swethakasireddy@gmail.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: How to join an RDD with a hive table?
>
>
>
> Have you tried creating a DataFrame from the RDD and join with DataFrame
> which corresponds to the hive table ?
>
>
>
> On Sun, Feb 14, 2016 at 9:53 PM, SRK <swethakasireddy@gmail.com> wrote:
>
> Hi,
>
> How to join an RDD with a hive table and retrieve only the records that I
> am
> interested. Suppose, I have an RDD that has 1000 records and there is a
> Hive
> table with 100,000 records, I should be able to join the RDD with the hive
> table  by an Id and I should be able to load only those 1000 records from
> Hive table so that are no memory issues. Also, I was planning on storing
> the
> data in hive in the form of parquet files. Any help on this is greatly
> appreciated.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>

Mime
View raw message