spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mitchell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.
Date Fri, 16 Feb 2018 16:15:02 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367529#comment-16367529
] 

Mitchell commented on SPARK-23420:
----------------------------------

Yes, I agree there appears to be no way currently for a user to distinguish a path to be treated
normally vs. one to be treated as a glob. I think having two separate methods for specifying,
or an option to specify how it should be treated. This probably isn't a common situation to
have files/paths with these characters in them, but it's possible and should be able to be
done.

> Datasource loading not handling paths with regex chars.
> -------------------------------------------------------
>
>                 Key: SPARK-23420
>                 URL: https://issues.apache.org/jira/browse/SPARK-23420
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.1
>            Reporter: Mitchell
>            Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to load files
with regex chars like []()* etc. in them. The files are valid in the various storages and
the normal hadoop APIs all function properly accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException:
Illegal file pattern: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???????_???????????????????_??????????????
^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???????_???????????????????_??????????????
^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at org.apache.hadoop.fs.GlobFilter.<init>(GlobFilter.java:50)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477)
at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???????_???????????????????_??????????????
^ at java.util.regex.Pattern.error(Pattern.java:1955) at java.util.regex.Pattern.compile(Pattern.java:1700)
at java.util.regex.Pattern.<init>(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1054)
at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 04:52:46
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw
exception: java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 130
A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???????_???????????????????_??????????????
^) 18/02/14 04:52:46 INFO spark.SparkContext: Invoking stop() from shutdown hook
>  
> Code is as follows ...
> Dataset<Row> input = sqlContext.read().option("header", "true").option("sep", ",").option("quote",
"\"").option("charset", "utf8").option("escape", "\\").csv("s3a://myBucket/A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#$%^&()-_=+[\{]};',._鞍鞍亜_Белебей_鞍鞍めぐみ林原ぐみ林原めぐみ솅ᄌ종대왕_ไชยแม็คอินमाधु/COLUMN_HEADER_PRESENT_a_longer_file_name_with_different_types_of_characters_including_numbers_upper_case_and_lower_case_鞍鞍亜_Белебей_鞍鞍_林原めぐみ林原めぐみ林原めぐみ솅_CSV_PIPE_DELIM.csv"));



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message