spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-25126) avoid creating OrcFile.Reader for all orc files
Date Thu, 23 Aug 2018 14:02:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon reassigned SPARK-25126:
------------------------------------

    Assignee: Rao Fu

> avoid creating OrcFile.Reader for all orc files
> -----------------------------------------------
>
>                 Key: SPARK-25126
>                 URL: https://issues.apache.org/jira/browse/SPARK-25126
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.3.1
>            Reporter: Rao Fu
>            Assignee: Rao Fu
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> We have a spark job that starts by reading orc files under an S3 directory and we noticed
the job consumes a lot of memory when both the number of orc files and the size of the file
are large. The memory bloat went away with the following workaround.
> 1) create a DataSet<Row> from a single orc file.
> Dataset<Row> rowsForFirstFile = spark.read().format("orc").load(oneFile);
> 2) when creating DataSet<Row> from all files under the directory, use the schema
from the previous DataSet.
> Dataset<Row> rows = spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);
> I believe the issue is due to the fact in order to infer the schema a FileReader is created
for each orc file under the directory although only the first one is used. The FileReader
creation loads the metadata of the orc file and the memory consumption is very high when there
are many files under the directory.
> The issue exists in both 2.0 and HEAD.
> In 2.0, OrcFileOperator.readSchema is used.
> [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]
> In HEAD, OrcUtils.readSchema is used.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message