spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "koert kuipers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-11441) HadoopFsRelation is not scalable in number of files read/written
Date Tue, 17 Nov 2015 22:17:11 GMT

    [ https://issues.apache.org/jira/browse/SPARK-11441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009690#comment-15009690
] 

koert kuipers commented on SPARK-11441:
---------------------------------------

one more place where the cache of FileStatus objects for all part files is used is in:
def sizeInBytes: Long
it seems that providing this information (the total size in bytes) forces the driver to visit
all the part files.

> HadoopFsRelation is not scalable in number of files read/written
> ----------------------------------------------------------------
>
>                 Key: SPARK-11441
>                 URL: https://issues.apache.org/jira/browse/SPARK-11441
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: koert kuipers
>
> HadoopFsRelation includes a fileStatusCache which holds information on all the datafiles
(part files) for the data source in the driver program.
> It is not unusual to be reading from 100k+ or even 1mm part files, in which case filling
up this cache will take a very long time (days?) and require a lot of memory. See for example:
> https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201510.mbox/%3CCAG+ckK-FvWK=1B2JqC4S+zaz2zVkQehvoS9Myo0ssMGPC5-3SQ@mail.gmail.com%3E
> This is not the kind of behavior you would expect of a driver program. Also HadoopFsRelation
passes this large list of part files into:
> def buildScan(inputFiles: Array[FileStatus]): RDD[Row]
> Almost all implementations of HadoopFsRelation do the following inside buildScan:
> FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
> This means an array of potentially millions of items now gets stored in the JobConf which
will be broadcast. I have not found any errors about this on mailing list but i believe this
is simply because nobody with a large number of inputFiles has gotten this far.
> Generally when using Hadoop InputFormats there should never be a need to list all the
part files driver side. It seems the reason it is done here is to facilitate a process in
ParquetRelation driver side that creates a merged data schema. I wonder if its really necessary
to look at all the part files for this, or if some assumption can be made that at least all
the part files in a directory have the same schema (which would reduce the size of the problem
by a factor 100 or so).
> At the very least it seems to be that the caching of files is parquet specific and does
not belong in HadoopFsRelation. And buildScan should just use the data paths (so directories
if one wants to read all part files in a directory) as it did before SPARK-7673 / PR #6225
> I ran into this issue myself with spark-avro, which also does not handle the input of
part files in buildScan well. Spark-avro actually tries to create an RDD (and jobConf broadcast)
per part file, which is not scalable even for 1k part files. Note that it is difficult for
spark-avro to create an RDD per data directory (as it probably should) since the dataPaths
have been lost now that the inputFiles is passed into buildScan instead. This to me again
confirms the change in buildScan is troubling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message