spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liang-Chi Hsieh (JIRA)" <>
Subject [jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
Date Thu, 16 Jul 2015 09:25:04 GMT


Liang-Chi Hsieh commented on SPARK-9067:

Thanks for reporting that.

I updated the PR. Besides calling close(), I also release reader now. Can you check if it
can solve this problem? Thanks.

> Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
> -----------------------------------------------------------------------------
>                 Key: SPARK-9067
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 1.3.0, 1.4.0
>         Environment: Target system: Linux, 16 cores, 400Gb RAM
> Spark is started locally using the following command:
> {{
> spark-submit --master local[16] --driver-memory 64G --executor-cores 16 --num-executors
1  --executor-memory 64G
> }}
>            Reporter: konstantin knizhnik
> If coalesce transformation with small number of output partitions (in my case 16) is
applied to large Parquet file (in my has about 150Gb with 215k partitions), then it case OutOfMemory
exceptions 250Gb is not enough) and open file limit exhaustion (with limit set to 8k).
> The source of the problem is in SqlNewHad\oopRDD.compute method:
> {quote}
>       val reader = format.createRecordReader(
>         split.serializableHadoopSplit.value, hadoopAttemptContext)
>       reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext)
>       // Register an on-task-completion callback to close the input stream.
>       context.addTaskCompletionListener(context => close())
> {quote}
> Created Parquet file reader is intended to be closed at task completion time. This reader
contains a lot of references to  parquet.bytes.BytesInput object which in turn contains reference
sot large byte arrays (some of them are several megabytes).
> As far as in case of CoalescedRDD task is completed only after processing larger number
of parquet files, it cause file handles exhaustion and memory overflow.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message