Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hawq.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: <CABQrizew6iHvuVEcqTv19T_QqJ1bw31kPiE6dnWrrUtf7i5mXw@mail.gmail.com>
References: <CADvk=Ga9H=5zpUcVG_vEPGgGrGfwqdpzqp9eajV6RXa5uqEkHw@mail.gmail.com>
 <CA+F1ufmJL8YjiOL4cRNY3gj3vuTi8rB7Bh4JwvV3hR8u304H4Q@mail.gmail.com>
 <CADvk=GYWfjZDePXzBiBbtPUnH=qD9=cEKGFxeoN1MUFgKXOdmA@mail.gmail.com> <CABQrizew6iHvuVEcqTv19T_QqJ1bw31kPiE6dnWrrUtf7i5mXw@mail.gmail.com>
From: Ming Li <mli@pivotal.io>
Date: Wed, 15 Mar 2017 18:18:03 +0800
Message-ID: <CA+F1ufmRywkhiFE2zU38OT2kTmAfmy7yFL6imGqZACHWzODEzg@mail.gmail.com>
Subject: Re: Questions about filesystem / filespace / tablespace
To: dev@hawq.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a114abc608a4051054ac24252
archived-at: Wed, 15 Mar 2017 10:18:17 -0000

--001a114abc608a4051054ac24252
Content-Type: text/plain; charset=UTF-8

Hi Kyle,


      If we keep all these filesystem similar to hdfs, only support append
only, then then change must be much less. I think we can go ahead to
implement a demo for it if we have resource, we may encounter problems, but
we can find more solution/workaround for it.
--------------------

      For your question about the relationship between 3 source code files,
below is my understanding (because the code is not written by me, my
opinion maybe not completely correct.)
(1) bin/gpfilesystem/hdfs/gpfshdfs.c -- implement all API used in hdfs
tuple in the catalog pg_filesystem, it will directly call API in libhdfs3
to access hdfs file system. The reason why make it a wrapper is to define
all these API as UDF, so that we can easily support similar filesystem by
adding a similar tuple in pg_filesystem, and add similar code as this file,
without changing any place calling these API. Also because they are UDF, we
can upgrade the old binary hawq to add new file system.
(2) backend/storage/file/filesystem.c -- because all API in (1) is in form
of UDF,  so we need a conversion if we want to directly call these API.
This file is responsible for converting normal hdfs calling in hawq kernel
to UDF calling.
(3) backend/storage/file/fd.c -- Because OS have file description open
number limitation, PostgreSQL/HAWQ will use a LUR buffer to cache all
opened file handlers. All hdfs API in this file also manage file handler
same as native file systems. These functions call API in (2) to interact
with hdfs.

     In a word,  the calling stack is:  (3) --> (2) --> (1) --> libhdfs3
API.
-------------------

    The last question about tablespace, PostgreSQL introduce it so that
user can set different tablespace to different paths, and these paths can
be mounted with different file system on linux. But all filesystems API are
the same, and the functionality are the same (supporting UPDATE in place).
So we cannot directly use tablespace to hand this scenario.  And also I
cannot guess how much effort needed because I did participate the hdfs file
system supporting in the hawq origin release.


That's my opinion, any correction or suggestion are welcomed! Hope it can
help you!  Thanks.


On Wed, Mar 15, 2017 at 11:07 AM, Paul Guo <paulguo@gmail.com> wrote:

> Hi Kyle,
>
> I'm not sure whether I understand your point correctly, but for FUSE which
> allows userspace file system implementation on Linux, users uses the
> filesystem (e.g. S3 in your example) as a block storage, accesses it via
> standard sys calls like open, close, read, write although some behaviours
> or sys call could probably be not supported. That means for query for FUSE
> fs, you are probably able to access them using the interfaces in fd.c
> directly (I'm not sure some hacking is needed), but for such kind of
> distributed file systems, compared with fuse access way, lib access is
> usually more encouraged since: 1) performance (You could search for the
> fuse theory to see the long fuse call paths which are added for file
> access) 2) stability (You add the fuse kernel part in your software stack
> and according to my experience it will be really painful to handle some
> exceptions). For such storage, I'd really prefer some other solutions, lib
> access like hawq or external table, whatever.
>
> Actually long long time ago I've seen fuse over hdfs on real production
> environment, so I'm actually curious whether someone have tried query it
> via this solution before and compared with the hawq for the performance,
> etc.
>
>
>
>
> 2017-03-15 1:26 GMT+08:00 Kyle Dunn <kdunn@pivotal.io>:
>
> > Ming -
> >
> > Great points about append-only. One potential work-around is to split a
> > table over multiple backend storage objects, (a new file for each append
> > operation), Then, maybe as part of VACUUM, perform object compaction. For
> > GCP, the server-side compaction capability for objects is called compose
> > <https://cloud.google.com/storage/docs/gsutil/commands/compose>. For
> AWS,
> > you can emulate this behavior using Multipart upload
> > <http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadInitiate.html> -
> > demonstrated concretely with the Ruby SDK here
> > <https://aws.amazon.com/blogs/developer/efficient-amazon-s3-
> > object-concatenation-using-the-aws-sdk-for-ruby/>.
> > Azure actually supports append-blobs
> > <https://blogs.msdn.microsoft.com/windowsazurestorage/2015/
> > 04/13/introducing-azure-storage-append-blob/>
> >  natively.
> >
> > For the FUSE exploration, can you (or anyone else) help me understand the
> > relationship and/or call graph between these different implementations?
> >
> >    - backend/storage/file/filesystem.c
> >    - bin/gpfilesystem/hdfs/gpfshdfs.c
> >    - backend/storage/file/fd.c
> >
> > I feel confident that everything HDFS-related ultimately uses
> > libhdfs3/src/client/Hdfs.cpp but it seems like a convoluted path for
> > getting there from the backend code.
> >
> > Also, it looks like normal Postgres allows tablespaces to be created like
> > this:
> >
> >       CREATE TABLESPACE fastspace LOCATION '/mnt/sda1/postgresql/data';
> >
> > This is much simpler than wrapping glibc calls and is exactly what would
> be
> > necessary if using FUSE modules + mount points to handle a "pluggable"
> > backend. Maybe you (or someone) can advise how much effort it would be to
> > bring "local:// FS" tablespace support back? It is potentially less than
> > trying to unravel all the HDFS-specific implementation scattered around
> the
> > backend code.
> >
> >
> > Thanks,
> > Kyle
> >
> > On Mon, Mar 13, 2017 at 8:35 PM Ming Li <mli@pivotal.io> wrote:
> >
> > > Hi Kyle,
> > >
> > > Good investigation!
> > >
> > > I think we can add a similar tuple as hdfs in the pg_filesystem at
> first,
> > > then implement all API introduce in this tuple to call the FUSE API.
> > >
> > > However because HAWQ are designed for hdfs which means only append-only
> > > file system, so when we support other types of filesystem, we should
> > > investigate how to improve the performance and transaction issues. The
> > > performance can be investigate after we implement a demo, but the
> > > transaction issue should be decided before. Append only file system
> don't
> > > support UPDATE in place, and the inserted data are traced by file
> length
> > in
> > > pg_aoseg.pg_aoseg_xxxxx or pg_parquet.pg_parquet_xxxxx.
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Mar 14, 2017 at 7:57 AM, Kyle Dunn <kdunn@pivotal.io> wrote:
> > >
> > > > Hello devs -
> > > >
> > > > I'm doing some reading about HAWQ tablespaces here:
> > > > http://hdb.docs.pivotal.io/212/hawq/ddl/ddl-tablespace.html
> > > >
> > > > I want to understand the flow of things, please correct me on the
> > > following
> > > > assumptions:
> > > >
> > > > 1) Create a filesystem (not *really* supported after HAWQ init) - the
> > > > default is obviously [lib]HDFS[3]:
> > > >       SELECT * FROM pg_filesystem;
> > > >
> > > > 2) Create a filespace, referencing the above file system:
> > > >       CREATE FILESPACE testfs ON hdfs
> > > >       ('localhost:8020/fs/testfs') WITH (NUMREPLICA = 1);
> > > >
> > > > 3) Create a tablespace, reference the above filespace:
> > > >       CREATE TABLESPACE fastspace FILESPACE testfs;
> > > >
> > > > 4) Create objects referencing the above table space, or set it as the
> > > > database's default:
> > > >       CREATE DATABASE testdb WITH TABLESPACE=testfs;
> > > >
> > > > Given this set of steps, it it true (*in theory*) an arbitrary
> > filesystem
> > > > (i.e. storage backend) could be added to HAWQ using *existing* APIs?
> > > >
> > > > I realize the nuances of this are significant, but conceptually I'd
> > like
> > > to
> > > > gather some details, mainly in support of this
> > > > <https://issues.apache.org/jira/browse/HAWQ-1270> ongoing JIRA
> > > discussion.
> > > > I'm daydreaming about whether this neat tool:
> > > > https://github.com/s3fs-fuse/s3fs-fuse could be useful for an S3
> spike
> > > > (which also seems to kind of work on Google Cloud, when
> > interoperability
> > > > <
> > > https://github.com/s3fs-fuse/s3fs-fuse/issues/109#
> issuecomment-286222694
> > >
> > > > mode is enabled). By it's Linux FUSE nature, it implements the lion's
> > > share
> > > > of required pg_filesystem functions; in fact, maybe we could actually
> > use
> > > > system calls from glibc (somewhat <http://www.linux-mag.com/id/7814/
> >)
> > > > directly in this situation.
> > > >
> > > > Curious to get some feedback.
> > > >
> > > >
> > > > Thanks,
> > > > Kyle
> > > > --
> > > > *Kyle Dunn | Data Engineering | Pivotal*
> > > > Direct: 303.905.3171 <(303)%20905-3171> <3039053171
> > <(303)%20905-3171>>
> > > | Email: kdunn@pivotal.io
> > > >
> > >
> > --
> > *Kyle Dunn | Data Engineering | Pivotal*
> > Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
> >
>

--001a114abc608a4051054ac24252--