spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antonio Piccolboni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
Date Fri, 25 Sep 2015 16:42:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908296#comment-14908296
] 

Antonio Piccolboni commented on SPARK-10804:
--------------------------------------------

Good suggestion

SPARK-10834

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> ------------------------------------------------
>
>                 Key: SPARK-10804
>                 URL: https://issues.apache.org/jira/browse/SPARK-10804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, load data
local inpath fails. Hiveserver2 docs explain in a quick comment that local now means local
to the server. I think this is just a rationalization for a bug. When a user types "local"

> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and create a "local"
item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way to create
a file local to that host, at least in spark. It may not be desirable to create user directories
on the thriftserver host or running file transfer services like scp. Moreover, it appears
that this syntax is unique to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL
INFILE in Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is specified,
the file is read by the client program on the client host and sent to the server. The file
can be given as a full path name to specify its exact location. If given as a relative path
name, the name is interpreted relative to the directory in which the client program was started".
This is not to say that the spark or hive teams are bound to what Oracle and Mysql do, but
to support the idea that the meaning of LOCAL is settled. For instance, the Impala documentation
says: "Currently, the Impala LOAD DATA statement only imports files from HDFS, not from the
local filesystem. It does not support the LOCAL keyword of the Hive LOAD DATA statement."
I think this is a better solution. The way things are in thriftserver, I developed a client
under the assumption that I could use LOAD DATA LOCAL INPATH and all tests where passing in
standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store spark is running
against , like HDFS, and that the user can upload data to it. 
> In the space of workarounds it is not terrible, but if you are trying to write a self-contained
spark package, that's a defeat and makes writing tests particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message