beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Cwik (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-2283) Consider using actual URIs instead of Strings/ResourceIds in relation to FileSystems
Date Fri, 12 May 2017 22:33:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008833#comment-16008833
] 

Luke Cwik edited comment on BEAM-2283 at 5/12/17 10:32 PM:
-----------------------------------------------------------

One proposal is to:
* read().from("string path") represents an unescaped URI with no query or fragment component,
potentially containing glob characters and '\' to escape glob characters
* add read().from(URI uri) for cases where users need to specify query/fragment components

Conversion from string to URI would be handled through a double escaping mechanism to support
glob expressions:
* file:/my/path* would represent file:/my/path followed by the glob expression '*', this would
be converted to the string file:/my/path%2A and then passed to a URI. FileSystem implementations
would need to inspect the URI for escaped glob expressions.
* file:/my/path\* would represent file:/my/path* (note that this is a file named path* and
not a glob expression), this would be converted to the string file:/my/path#%5C%2A and then
passed to a URI. FileSystem implementations would need to inspect the URI, notice that it
is not a glob expression and treat the unescaped path segment as a literal.

It would be important for FileSystem implementations to work on the URI and components and
path segments individually converting to their own internal representation and failing if
necessary.

Glob characters *, [], and ? would be understood and used by the internals of Apache Beam
and glob conversion from *, [], ? to internal FileSystem glob representations would be FileSystem
dependent.

This proposal has the benefits that:
* users have the minimal amount of escaping that they need to do (only escape the set of glob
characters when the want things named with *, [], and ?)
* file:/my/path* is a canonical representation that most users would expect to represent file:/my/path
followed by the glob *


was (Author: lcwik):
One proposal is to:
* read().from("string path") represents an unescaped URI with no query or fragment component,
potentially containing glob characters and '\' to escape glob characters
* add read().from(URI uri) for cases where users need to specify query/fragment components

Conversion from string to URI would be handled through a double escaping mechanism to support
glob expressions:
* file:/my/path* would represent file:/my/path followed by the glob expression '*', this would
be converted to the string file:/my/path%2A and then passed to a URI. FileSystem implementations
would need to inspect the URI for escaped glob expressions.
* file:/my/path\* would represent file:/my/path* (note that this is a file named path* and
not a glob expression), this would be converted to the string file:/my/path#%5C%2A and then
passed to a URI. FileSystem implementations would need to inspect the URI, notice that it
is not a glob expression and treat the unescaped path segment as a literal.

It would be important for FileSystem implementations to work on the URI and components and
path segments individually converting to their own internal representation and failing if
necessary.

Glob characters *, [], and ? would be understood and used by the internals of Apache Beam
and glob conversion from *, [], ? to internal FileSystem glob representations would be FileSystem
dependent.

This proposal has the benefits that:
* users have the minimal amount of escaping that they need to do (only escape the set of glob
characters)
* file:/my/path* is a canonical representation that most users would expect to represent file:/my/path
followed by the glob *

> Consider using actual URIs instead of Strings/ResourceIds in relation to FileSystems
> ------------------------------------------------------------------------------------
>
>                 Key: BEAM-2283
>                 URL: https://issues.apache.org/jira/browse/BEAM-2283
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core, sdk-java-extensions, sdk-java-gcp, sdk-py
>            Reporter: Luke Cwik
>
> We treat things like URIs because we expect them to have a scheme component and to be
able to resolve a parent/child but fail to treat them as URIs in the internal implementation
since our string versions don't go through URI normalization. This brings up a few issues:
> * The cost of implementing and maintaining ResourceIds instead of having users use a
standard URI implementation. This would just require FileSystems to be able to take a string
and give back a URI (to enable them to have custom implementations in case they extend the
concept of URIs with scheme specific extensions).
> * The myriad of bugs that will come up because of improper usage of URI like strings
and the assumptions associated with them (like https://issues.apache.org/jira/browse/BEAM-2277)
> Note that swapping to URIs adds complexity because:
> * Resolving URIs with glob expressions needs to be handled carefully
> * FileSystems may need to implement a complicated type instead of ResourceId.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message