spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Documentation on "Automatic file coalescing for native data sources"?
Date Fri, 19 May 2017 23:15:59 GMT
I think like all other read operations, it is driven by input format used,
and I think some variation of combine file input format is used by default.
I think you can test it by force a particular input format which gets ine
file per split, then you should end up with same number of partitions as
your dsta files

On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aakash.spark.raj@gmail.com>
wrote:

> Hey all,
>
> A reply on this would be great!
>
> Thanks,
> A.B.
>
> On 17-May-2017 1:43 AM, "Daniel Siegmann" <dsiegmann@securityscorecard.io>
> wrote:
>
>> When using spark.read on a large number of small files, these are
>> automatically coalesced into fewer partitions. The only documentation I can
>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>> http://spark.apache.org/releases/spark-release-2-0-0.html):
>>
>> "Automatic file coalescing for native data sources"
>>
>> Can anyone point me to documentation explaining what triggers this
>> feature, how it decides how many partitions to coalesce to, and what counts
>> as a "native data source"? I couldn't find any mention of this feature in
>> the SQL Programming Guide and Google was not helpful.
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>> --
Best Regards,
Ayan Guha

Mime
View raw message