Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 37D58200C18 for ; Sat, 11 Feb 2017 22:01:52 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 3649F160B5B; Sat, 11 Feb 2017 21:01:52 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5D81F160B4C for ; Sat, 11 Feb 2017 22:01:51 +0100 (CET) Received: (qmail 85802 invoked by uid 500); 11 Feb 2017 21:01:50 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 85793 invoked by uid 99); 11 Feb 2017 21:01:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 11 Feb 2017 21:01:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 22EC61A05FF for ; Sat, 11 Feb 2017 21:01:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 5BrQkUS_4W4S for ; Sat, 11 Feb 2017 21:01:48 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B97B85F1B3 for ; Sat, 11 Feb 2017 21:01:47 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9A8AEE04B5 for ; Sat, 11 Feb 2017 21:01:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B580721D69 for ; Sat, 11 Feb 2017 21:01:41 +0000 (UTC) Date: Sat, 11 Feb 2017 21:01:41 +0000 (UTC) From: "Alex S (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 11 Feb 2017 21:01:52 -0000 [ https://issues.apache.org/jira/browse/SPARK-19340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862532#comment-15862532 ] Alex S commented on SPARK-19340: -------------------------------- Looks like the error happens when spark tries to infers a schema from the file, which happens in csv case. If you configure user-defined schema it should work. I haven't tried it with hdfs though. {code} spark.read.option("header","false").schema(customSchema).csv("/test*.txt") {code} customSchema - is the schema you define for your csv file. > Opening a file in CSV format will result in an exception if the filename contains special characters > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-19340 > URL: https://issues.apache.org/jira/browse/SPARK-19340 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0 > Reporter: Reza Safi > Priority: Minor > > If you want to open a file that its name is like {noformat} "*{*}*.*" {noformat} or {noformat} "*[*]*.*" {noformat} using CSV format, you will get the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the file is a local file or on hdfs. > This bug can be reproduced on master and all other Spark 2 branches. > To reproduce: > # Create a file like "test{00-1}.txt" on a local directory (like in /Users/reza/test/test{00-1}.txt) > # Run spark-shell > # Execute this command: > {noformat} > val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt") > {noformat} > You will see the following stack trace: > {noformat} > org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/reza/test/test\{00-01\}.txt; > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360) > at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) > at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208) > at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173) > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360) > ... 48 elided > {noformat} > If you put the file on hadoop (like on /user/root) when you try to run the following: > {noformat} > val df=spark.read.option("header", false).csv("/user/root/*.txt") > {noformat} > > You will get the following exception: > {noformat} > org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files > at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297) > at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) > at org.apache.spark.rdd.RDD.take(RDD.scala:1292) > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1332) > at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) > at org.apache.spark.rdd.RDD.first(RDD.scala:1331) > at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:167) > at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:59) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:420) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349) > ... 48 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org