spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
Subject Re: Spark hive overwrite is very very slow
Date Sun, 20 Aug 2017 16:47:12 GMT
Hi,

I have created hive table in impala first with storage format as parquet.
With dataframe from spark I am tryinig to insert into the same table with
below syntax.

Table is partitoned by year,month,day
ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")

https://issues.apache.org/jira/browse/SPARK-20049

I saw something in the above link not sure if that is same thing in my case.

Thanks,
Asmath

On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfranke@gmail.com> wrote:

> Have you made sure that the saveastable stores them as parquet?
>
> On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com>
> wrote:
>
> we are using parquet tables, is it causing any performance issue?
>
> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfranke@gmail.com> wrote:
>
>> Improving the performance of Hive can be also done by switching to
>> Tez+llap as an engine.
>> Aside from this : you need to check what is the default format that it
>> writes to Hive. One issue for the slow storing into a hive table could be
>> that it writes by default to csv/gzip or csv/bzip2
>>
>> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <
>> mdkhajaasmath@gmail.com> wrote:
>> >
>> > Yes we tried hive and want to migrate to spark for better performance.
>> I am using paraquet tables . Still no better performance while loading.
>> >
>> > Sent from my iPhone
>> >
>> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfranke@gmail.com> wrote:
>> >>
>> >> Have you tried directly in Hive how the performance is?
>> >>
>> >> In which Format do you expect Hive to write? Have you made sure it is
>> in this format? It could be that you use an inefficient format (e.g. CSV +
>> bzip2).
>> >>
>> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <
>> mdkhajaasmath@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I have written spark sql job on spark2.0 by using scala . It is just
>> pulling the data from hive table and add extra columns , remove duplicates
>> and then write it back to hive again.
>> >>>
>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of data.
>> Is there anything that I need to improve performance .
>> >>>
>> >>> Spark.sql.partitions is 2000 in my case with executor memory of 16gb
>> and dynamic allocation enabled.
>> >>>
>> >>> I am doing insert overwrite on partition by
>> >>> Da.write.mode(overwrite).insertinto(table)
>> >>>
>> >>> Any suggestions please ??
>> >>>
>> >>> Sent from my iPhone
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>>
>>
>
>

Mime
View raw message