reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Weimer <mar...@weimo.de>
Subject Re: [REEF-1892] HDFS File Copy only uses local HDFS
Date Mon, 25 Sep 2017 19:57:24 GMT
Hi,

maybe we can code around both requirements. Have a look at this code:

```
string host = "hOst";
string protocol = "http";
string path = "path/to/fIle.txt";
Uri foo = new Uri($"{protocol}://{host}/{path}");

Console.WriteLine($"Uri.ToString: {foo.ToString()}");
Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}");
```

It prints  as:

```
Uri.ToString: http://host/path/to/fIle.txt
Uri.OriginalString: http://hOst/path/to/fIle.txt
```

Hence, we can use the `OriginalString` method to fix this. However, we
would loose the benefit of the `Uri` class normalizing that string. We can
either add a configuration parameter for this or make its handling part of
an exception handler when files can't be found.

WDYT?

Markus



On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi <shouyi@microsoft.com.invalid>
wrote:

> Hi Markus and Rogan,
>
> I proposed this REEF-1827, because some clusters have specific rules on
> their hostname configuration - each letter in the hostname must be
> correctly capitalized. Keeping or undoing REEF-1827 both have their pros
> and cons. Here's an inconclusive list that I can summarize:
>
> Keeping REEF-1827
> Pros:
> 1. It caters to the clusters with strange DNS configs and majority of the
> distributed file systems are based on HDFS.
> 2. It clarifies user's responsibility of specifying exact file path on the
> distributed file system, with no room for interpretation.
> Cons:
> 1. It will not be compatible with wasb or other distributed file system.
>
> Undoing REEF-1827
> Pros:
> 1. It can infer the file system when doing "dfs."
> 2. It forces users to adopt correct naming convention of hostnames.
> However mostly likely hostnames come before applications, so it's difficult
> for applications to change cluster setups.
> Cons:
> 1. We need to make forks for those strange clusters and provide support
> for those forks.
>
> Fix:
> I think it's cluster users' responsibility to point correctly where the
> file is. I believe if we do exactly what they typed in their program, it
> will be easier for the users to debug. I think we can keep REEF-1827 but
> also let user to specify what file system is being used. We then can
> construct a file path and check before "dfs" is called to make sure that
> the file path is valid.
>
> Best,
> Shouheng
>
> -----Original Message-----
> From: Rogan Carr [mailto:rogan.carr@gmail.com]
> Sent: Sunday, September 24, 2017 8:53 PM
> To: dev@reef.apache.org
> Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
>
> Hi Markus,
>
> >> There is no pretty solution that comes to mind. From a principled
> >> standpoint, we should undo REEF-1827. Hostnames are supposed to be
> >> case insensitive. However, clusters which don't adhere to that standard
> exist.
> >> Hence, we might need some work-around for them.
>
> I think the best path forward is for us to put together a fix that
> provides the former functionality along with a workaround for the
> capitalization issue addressed in REEF-1827. I'd rather not roll back
> REEF-1827 unless this turns out to be a difficult undertaking.
>
> Best,
> Rogan
>
> On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <markus@weimo.de> wrote:
>
> > This looks like a really nasty interaction between the cluster
> > infrastructure and our code:
> >
> > REEF-1827 became necessary because some clusters have odd DNS setups
> > where the capitalization of hostnames mattered.
> > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the same
> > file as `hdfs://myfancynamenode/some/path.txt`. Stripping the protocol
> > and host from the URL fixes that.
> >
> > However, that assumes that the relative path given then is evaluated
> > with respect to the right host and protocol. This assumption is true,
> > if it references a file on the *default* protocol and host of the
> cluster.
> > However, that default filesystem on HDI seems to be the local HDFS of
> > the cluster, not the WASB filesystem.
> >
> > There is no pretty solution that comes to mind. From a principled
> > standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > case insensitive. However, clusters which don't adhere to that standard
> exist.
> > Hence, we might need some work-around for them.
> >
> > Markus
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message