spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Spark Sql behaves strangely with tables with a lot of partitions
Date Mon, 24 Aug 2015 18:12:38 GMT
I think we are mostly bottlenecked at this point by how fast we can make
listStatus calls to discover the folders.  That said, we are happy to
accept suggestions or PRs to make this faster.  Perhaps you can describe
how your home grown partitioning works?

On Sun, Aug 23, 2015 at 7:38 PM, Philip Weaver <philip.weaver@gmail.com>
wrote:

> 1 minute to discover 1000s of partitions -- yes, that is what I have
> observed. And I would assert that is very slow.
>
> On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> We should not be actually scanning all of the data of all of the
>> partitions, but we do need to at least list all of the available
>> directories so that we can apply your predicates to the actual values that
>> are present when we are deciding which files need to be read in a given
>> spark job.  While this is a somewhat expensive operation, we do it in
>> parallel and we cache this information when you access the same relation
>> more than once.
>>
>> Can you provide a little more detail about how exactly you are accessing
>> the parquet data (are you using sqlContext.read or creating persistent
>> tables in the metastore?), and how long it is taking?  It would also be
>> good to know how many partitions we are talking about and how much data is
>> in each.  Finally, I'd like to see the stacktrace where it is hanging to
>> make sure my above assertions are correct.
>>
>> We have several tables internally that have 1000s of partitions and while
>> it takes ~1 minute initially to discover the metadata, after that we are
>> able to query the data interactively.
>>
>>
>>
>> On Sun, Aug 23, 2015 at 2:00 AM, Jerrick Hoang <jerrickhoang@gmail.com>
>> wrote:
>>
>>> anybody has any suggestions?
>>>
>>> On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang <jerrickhoang@gmail.com>
>>> wrote:
>>>
>>>> Is there a workaround without updating Hadoop? Would really appreciate
>>>> if someone can explain what spark is trying to do here and what is an easy
>>>> way to turn this off. Thanks all!
>>>>
>>>> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
>>>> raghavendra.pandey@gmail.com> wrote:
>>>>
>>>>> Did you try with hadoop version 2.7.1 .. It is known that s3a works
>>>>> really well with parquet which is available in 2.7. They fixed lot of
>>>>> issues related to metadata reading there...
>>>>> On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickhoang@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>>>>>>
>>>>>> (table is partitioned by date_prefix and hour)
>>>>>> explain select count(*) from test_table where date_prefix='20150819'
>>>>>> and hour='00';
>>>>>>
>>>>>> TungstenAggregate(key=[],
>>>>>> value=[(count(1),mode=Final,isDistinct=false)]
>>>>>>  TungstenExchange SinglePartition
>>>>>>   TungstenAggregate(key=[],
>>>>>> value=[(count(1),mode=Partial,isDistinct=false)]
>>>>>>    Scan ParquetRelation[ .. <about 1000 partition paths go here>
]
>>>>>>
>>>>>> Why does spark have to scan all partitions when the query only
>>>>>> concerns with 1 partitions? Doesn't it defeat the purpose of partitioning?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <
>>>>>> philip.weaver@gmail.com> wrote:
>>>>>>
>>>>>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled
>>>>>>> before, and I couldn't find much information about it online.
What does it
>>>>>>> mean exactly to disable it? Are there any negative consequences
to
>>>>>>> disabling it?
>>>>>>>
>>>>>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.cheng@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you make some more profiling? I am wondering if the driver
is
>>>>>>>> busy with scanning the HDFS / S3.
>>>>>>>>
>>>>>>>> Like jstack <pid of driver process>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> And also, it’s will be great if you can paste the physical
plan for
>>>>>>>> the simple query.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Jerrick Hoang [mailto:jerrickhoang@gmail.com]
>>>>>>>> *Sent:* Thursday, August 20, 2015 1:46 PM
>>>>>>>> *To:* Cheng, Hao
>>>>>>>> *Cc:* Philip Weaver; user
>>>>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with
a lot
>>>>>>>> of partitions
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were
a
>>>>>>>> couple of CLs trying to speed up spark sql with tables with
a huge number
>>>>>>>> of partitions, I've made sure that those CLs are included
but it's still
>>>>>>>> very slow
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.cheng@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes, you can try set the
>>>>>>>> spark.sql.sources.partitionDiscovery.enabled to false.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW, which version are you using?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hao
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Jerrick Hoang [mailto:jerrickhoang@gmail.com]
>>>>>>>> *Sent:* Thursday, August 20, 2015 12:16 PM
>>>>>>>> *To:* Philip Weaver
>>>>>>>> *Cc:* user
>>>>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with
a lot
>>>>>>>> of partitions
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I guess the question is why does spark have to do partition
>>>>>>>> discovery with all partitions when the query only needs to
look at one
>>>>>>>> partition? Is there a conf flag to turn this off?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <
>>>>>>>> philip.weaver@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I've had the same problem. It turns out that Spark (specifically
>>>>>>>> parquet) is very slow at partition discovery. It got better
in 1.5 (not yet
>>>>>>>> released), but was still unacceptably slow. Sadly, we ended
up reading
>>>>>>>> parquet files manually in Python (via C++) and had to abandon
Spark SQL
>>>>>>>> because of this problem.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <
>>>>>>>> jerrickhoang@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I did a simple experiment with Spark SQL. I created a partitioned
>>>>>>>> parquet table with only one partition (date=20140701). A
simple `select
>>>>>>>> count(*) from table where date=20140701` would run very fast
(0.1 seconds).
>>>>>>>> However, as I added more partitions the query takes longer
and longer. When
>>>>>>>> I added about 10,000 partitions, the query took way too long.
I feel like
>>>>>>>> querying for a single partition should not be affected by
having more
>>>>>>>> partitions. Is this a known behaviour? What does spark try
to do here?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jerrick
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message