reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Weimer <mar...@weimo.de>
Subject Re: [REEF-1892] HDFS File Copy only uses local HDFS
Date Wed, 27 Sep 2017 00:16:57 GMT
We seem to have converged. Who wants to take this on, coding wise?

Thanks,

Markus

On Tue, Sep 26, 2017 at 5:11 PM, Rogan Carr <rogan.carr@gmail.com> wrote:

> Hi Markus,
>
> I'd say an exception handler so that we don't have to bubble up the option
> to the client and change the IFileSystem interface.
>
> Best,
> Rogan
>
> On Mon, Sep 25, 2017 at 4:28 PM, Shouheng Yi <shouyi@microsoft.com.invalid
> >
> wrote:
>
> > +1
> >
> > -----Original Message-----
> > From: Markus Weimer [mailto:markus@weimo.de]
> > Sent: Monday, September 25, 2017 12:57 PM
> > To: REEF Developers Mailinglist <dev@reef.apache.org>
> > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> >
> > Hi,
> >
> > maybe we can code around both requirements. Have a look at this code:
> >
> > ```
> > string host = "hOst";
> > string protocol = "http";
> > string path = "path/to/fIle.txt";
> > Uri foo = new Uri($"{protocol}://{host}/{path}");
> >
> > Console.WriteLine($"Uri.ToString: {foo.ToString()}");
> > Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}"); ```
> >
> > It prints  as:
> >
> > ```
> > Uri.ToString: http://host/path/to/fIle.txt
> > Uri.OriginalString: http://hOst/path/to/fIle.txt ```
> >
> > Hence, we can use the `OriginalString` method to fix this. However, we
> > would loose the benefit of the `Uri` class normalizing that string. We
> can
> > either add a configuration parameter for this or make its handling part
> of
> > an exception handler when files can't be found.
> >
> > WDYT?
> >
> > Markus
> >
> >
> >
> > On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi
> <shouyi@microsoft.com.invalid
> > >
> > wrote:
> >
> > > Hi Markus and Rogan,
> > >
> > > I proposed this REEF-1827, because some clusters have specific rules
> > > on their hostname configuration - each letter in the hostname must be
> > > correctly capitalized. Keeping or undoing REEF-1827 both have their
> > > pros and cons. Here's an inconclusive list that I can summarize:
> > >
> > > Keeping REEF-1827
> > > Pros:
> > > 1. It caters to the clusters with strange DNS configs and majority of
> > > the distributed file systems are based on HDFS.
> > > 2. It clarifies user's responsibility of specifying exact file path on
> > > the distributed file system, with no room for interpretation.
> > > Cons:
> > > 1. It will not be compatible with wasb or other distributed file
> system.
> > >
> > > Undoing REEF-1827
> > > Pros:
> > > 1. It can infer the file system when doing "dfs."
> > > 2. It forces users to adopt correct naming convention of hostnames.
> > > However mostly likely hostnames come before applications, so it's
> > > difficult for applications to change cluster setups.
> > > Cons:
> > > 1. We need to make forks for those strange clusters and provide
> > > support for those forks.
> > >
> > > Fix:
> > > I think it's cluster users' responsibility to point correctly where
> > > the file is. I believe if we do exactly what they typed in their
> > > program, it will be easier for the users to debug. I think we can keep
> > > REEF-1827 but also let user to specify what file system is being used.
> > > We then can construct a file path and check before "dfs" is called to
> > > make sure that the file path is valid.
> > >
> > > Best,
> > > Shouheng
> > >
> > > -----Original Message-----
> > > From: Rogan Carr [mailto:rogan.carr@gmail.com]
> > > Sent: Sunday, September 24, 2017 8:53 PM
> > > To: dev@reef.apache.org
> > > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> > >
> > > Hi Markus,
> > >
> > > >> There is no pretty solution that comes to mind. From a principled
> > > >> standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > > >> case insensitive. However, clusters which don't adhere to that
> > > >> standard
> > > exist.
> > > >> Hence, we might need some work-around for them.
> > >
> > > I think the best path forward is for us to put together a fix that
> > > provides the former functionality along with a workaround for the
> > > capitalization issue addressed in REEF-1827. I'd rather not roll back
> > > REEF-1827 unless this turns out to be a difficult undertaking.
> > >
> > > Best,
> > > Rogan
> > >
> > > On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <markus@weimo.de>
> wrote:
> > >
> > > > This looks like a really nasty interaction between the cluster
> > > > infrastructure and our code:
> > > >
> > > > REEF-1827 became necessary because some clusters have odd DNS setups
> > > > where the capitalization of hostnames mattered.
> > > > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the
> > > > same file as `hdfs://myfancynamenode/some/path.txt`. Stripping the
> > > > protocol and host from the URL fixes that.
> > > >
> > > > However, that assumes that the relative path given then is evaluated
> > > > with respect to the right host and protocol. This assumption is
> > > > true, if it references a file on the *default* protocol and host of
> > > > the
> > > cluster.
> > > > However, that default filesystem on HDI seems to be the local HDFS
> > > > of the cluster, not the WASB filesystem.
> > > >
> > > > There is no pretty solution that comes to mind. From a principled
> > > > standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > > > case insensitive. However, clusters which don't adhere to that
> > > > standard
> > > exist.
> > > > Hence, we might need some work-around for them.
> > > >
> > > > Markus
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message