spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Baley (JIRA)" <>
Subject [jira] [Commented] (SPARK-13046) Partitioning looks broken in 1.6
Date Thu, 04 Feb 2016 00:04:39 GMT


Julien Baley commented on SPARK-13046:

Hi Davies,

I have no other file in the middle of the paths, we store everything after the fingerprint.

Could you perhaps try a structure closer to the one I described? With 2 key/value pairs and
every file stored after those? The heterogeneity of your example may make it work, somehow?

> Partitioning looks broken in 1.6
> --------------------------------
>                 Key: SPARK-13046
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Julien Baley
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the
three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> {code}
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
> s3://bucket/some_path/date_received=2016-01-15
> {code}
> That is to say, the partitioning code now fails to identify date_received=2016-01-13
as a key/value pair.
> I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
recently, so that seems related (especially the commits
 and ).
> If I read correctly the tests added in those commits:
> -they don't seem to actually test the return value, only that it doesn't crash
> -they only test cases where the s3 path contain 1 key/value pair (which otherwise would
catch the bug)
> This is problematic for us as we're trying to migrate all of our spark services to 1.6.0
and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not
do that if the bug can be fixed.
> Any question, please shoot.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message