spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Chammas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18367) limit() makes the lame walk again
Date Wed, 09 Nov 2016 23:46:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652406#comment-15652406
] 

Nicholas Chammas commented on SPARK-18367:
------------------------------------------

I've spent the day trying to narrow down what is happening, but I haven't made much progress.
All I've really found is that adding a {{coalesce()}} at the "right" place in the query tree
(dunno if that's the right terminology) can reduce the number of open files enough that things
succeed, like how the {{limit()}} is helping things. Maybe Spark just needs that many files
and there is no issue here? I dunno.

Is there a rough rule of thumb I can use to determine how many files Spark should be opening?
Just a rough way to determine the order of magnitude of open files based on something I'm
doing. In my case, I have a DataFrame with no more than a few partitions that I'm applying
a UDF to and then joining to itself twice. The resulting DataFrame has no more than a dozen
or so partitions.

Is it conceivable that this would somehow make Spark spawn more than 10K files, which is the
maximum number of files macOS will allow open per process? Even if I coalesce the DataFrame
to 1 partition I still see Spark spawn up to 7K files. Is this normal?

> limit() makes the lame walk again
> ---------------------------------
>
>                 Key: SPARK-18367
>                 URL: https://issues.apache.org/jira/browse/SPARK-18367
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1, 2.1.0
>         Environment: Python 3.5, Java 8
>            Reporter: Nicholas Chammas
>         Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I add a dummy
{{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
(Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on macOS. However,
I don't think that's the issue, since if I add a dummy {{limit()}} early on the query tree
-- dummy as in it does not actually reduce the number of rows queried -- then the same query
works.
> I've diffed the physical query plans to see what this {{limit()}} is actually doing,
and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <    :                          :     :     +- *GlobalLimit 1000000
> <    :                          :     :        +- Exchange SinglePartition
> <    :                          :     :           +- *LocalLimit 1000000
> <    :                          :     :              +- *Project [...]
> <    :                          :     :                 +- *Scan orc [...] Format:
ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
> ---
> >    :                          :     :     +- *Scan orc [...] Format: ORC, InputPaths:
file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
> 49,53c45
> <                               :     :     +- *GlobalLimit 1000000
> <                               :     :        +- Exchange SinglePartition
> <                               :     :           +- *LocalLimit 1000000
> <                               :     :              +- *Project [...]
> <                               :     :                 +- *Scan orc [...] Format:
ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
> ---
> >                               :     :     +- *Scan orc [] Format: ORC, InputPaths:
file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 1000000 limit
you can see in the plan is much higher than the cardinality of the dataset I'm reading, so
there is no theoretical impact on the output. You can see the full query plans attached to
this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can work towards
one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message