spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Homberg (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel
Date Thu, 06 Jun 2019 15:20:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christian Homberg updated SPARK-27966:
--------------------------------------
    Description: 
I ran into an issue similar and probably related to SPARK-26128. The _org.apache.spark.sql.functions.input_file_name_
is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-----------------+
|input_file_name()|
+-----------------+
| |
| |
| |
| |
| |
+-----------------+
{code}
My environment is databricks and debugging the Log4j output showed me that the issue occurred
when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the issue
for me.

 

*edit: the problem is not exclusively linked to listing files in parallel. I've setup a larger
cluster for which after parallel file listing the input_file_name did return the correct filename.
After inspecting the log4j again, I assume that it's linked to some kind of MetaStore being
full. I've attached a section of the log4j output that I think should indicate why it's failing.
If you need more, please let me know.*

 ** 

 

 

  was:
I ran into an issue similar and probably related to SPARK-26128. The _org.apache.spark.sql.functions.input_file_name_
is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-----------------+
|input_file_name()|
+-----------------+
| |
| |
| |
| |
| |
+-----------------+
{code}
My environment is databricks and debugging the Log4j output showed me that the issue occurred
when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of
Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the issue
for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've setup a larger
cluster for which after parallel file listing the input_file_name did return the correct filename.
After inspecting the log4j again, I assume that it's linked to some kind of MetaStore being
full. I've attached a section of the log4j output that I think should indicate why it's failing.
If you need more, please let me know.

 

 

 


> input_file_name empty when listing files in parallel
> ----------------------------------------------------
>
>                 Key: SPARK-27966
>                 URL: https://issues.apache.org/jira/browse/SPARK-27966
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.0
>         Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>            Reporter: Christian Homberg
>            Priority: Minor
>         Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The _org.apache.spark.sql.functions.input_file_name_
is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-----------------+
> |input_file_name()|
> +-----------------+
> | |
> | |
> | |
> | |
> | |
> +-----------------+
> {code}
> My environment is databricks and debugging the Log4j output showed me that the issue
occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in parallel
under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size
of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the
issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. I've setup
a larger cluster for which after parallel file listing the input_file_name did return the
correct filename. After inspecting the log4j again, I assume that it's linked to some kind
of MetaStore being full. I've attached a section of the log4j output that I think should indicate
why it's failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message