hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhanwei Wang <apa...@wangzw.org>
Subject Re: Questions about filesystem / filespace / tablespace
Date Wed, 15 Mar 2017 12:58:01 GMT
Hi Kyle

Let me tell some history about HAWQ. It’s about six years ago…

When we were starting design HAWQ. We first implemented a demo version of HAWQ, of cause it
was not called HAWQ at that time. It was called GoH (Greenplum on HDFS).  The first implement
is quite simple. We mount HDFS on local filesystem with FUSE and run GPDB on it. And quickly
we found that the performance is unacceptable. 

And then we decided to replace the storage layer of GPDB to make it work with HDFS. And we
implemented a “pluggable filesystem”  layer and added pg_filesystem object to GPDB. That
was the HAWQ at about early 2012.  

At first we wanted to exactly adopt HDFS C API because it is almost the de facto standard
but we found that it cannot meet our requirement. So based on HDFS C API we implement a wrapper
of it as our API standard. Any dynamic library which implement this API can be loaded into
HAWQ and register into pg_filesystem catalog, used to access file on target filesystem without
modify HAWQ code.

But this “pluggable filesystem” is never officially marked as a feature of HAWQ. We never
tested it with other filesystem except HDFS. And as far as I known some new API was never
added into pg_filesystem catalog due to history reason. So I do not think “pluggable filesystem”
can work now without any change and bug fix.

Pluggable filesystem is charming but unfortunately it was never get enough priority. And the
previous design maybe not suitable anymore. I guess it is a good change to rethink how we
can achieve this goal and make it happen.




Zhanwei Wang

HashData
http://www.hashdata.cn



> 在 2017年3月15日,下午6:18,Ming Li <mli@pivotal.io> 写道:
> 
> Hi Kyle,
> 
> 
>      If we keep all these filesystem similar to hdfs, only support append
> only, then then change must be much less. I think we can go ahead to
> implement a demo for it if we have resource, we may encounter problems, but
> we can find more solution/workaround for it.
> --------------------
> 
>      For your question about the relationship between 3 source code files,
> below is my understanding (because the code is not written by me, my
> opinion maybe not completely correct.)
> (1) bin/gpfilesystem/hdfs/gpfshdfs.c -- implement all API used in hdfs
> tuple in the catalog pg_filesystem, it will directly call API in libhdfs3
> to access hdfs file system. The reason why make it a wrapper is to define
> all these API as UDF, so that we can easily support similar filesystem by
> adding a similar tuple in pg_filesystem, and add similar code as this file,
> without changing any place calling these API. Also because they are UDF, we
> can upgrade the old binary hawq to add new file system.
> (2) backend/storage/file/filesystem.c -- because all API in (1) is in form
> of UDF,  so we need a conversion if we want to directly call these API.
> This file is responsible for converting normal hdfs calling in hawq kernel
> to UDF calling.
> (3) backend/storage/file/fd.c -- Because OS have file description open
> number limitation, PostgreSQL/HAWQ will use a LUR buffer to cache all
> opened file handlers. All hdfs API in this file also manage file handler
> same as native file systems. These functions call API in (2) to interact
> with hdfs.
> 
>     In a word,  the calling stack is:  (3) --> (2) --> (1) --> libhdfs3
> API.
> -------------------
> 
>    The last question about tablespace, PostgreSQL introduce it so that
> user can set different tablespace to different paths, and these paths can
> be mounted with different file system on linux. But all filesystems API are
> the same, and the functionality are the same (supporting UPDATE in place).
> So we cannot directly use tablespace to hand this scenario.  And also I
> cannot guess how much effort needed because I did participate the hdfs file
> system supporting in the hawq origin release.
> 
> 
> That's my opinion, any correction or suggestion are welcomed! Hope it can
> help you!  Thanks.
> 
> 
> On Wed, Mar 15, 2017 at 11:07 AM, Paul Guo <paulguo@gmail.com> wrote:
> 
>> Hi Kyle,
>> 
>> I'm not sure whether I understand your point correctly, but for FUSE which
>> allows userspace file system implementation on Linux, users uses the
>> filesystem (e.g. S3 in your example) as a block storage, accesses it via
>> standard sys calls like open, close, read, write although some behaviours
>> or sys call could probably be not supported. That means for query for FUSE
>> fs, you are probably able to access them using the interfaces in fd.c
>> directly (I'm not sure some hacking is needed), but for such kind of
>> distributed file systems, compared with fuse access way, lib access is
>> usually more encouraged since: 1) performance (You could search for the
>> fuse theory to see the long fuse call paths which are added for file
>> access) 2) stability (You add the fuse kernel part in your software stack
>> and according to my experience it will be really painful to handle some
>> exceptions). For such storage, I'd really prefer some other solutions, lib
>> access like hawq or external table, whatever.
>> 
>> Actually long long time ago I've seen fuse over hdfs on real production
>> environment, so I'm actually curious whether someone have tried query it
>> via this solution before and compared with the hawq for the performance,
>> etc.
>> 
>> 
>> 
>> 
>> 2017-03-15 1:26 GMT+08:00 Kyle Dunn <kdunn@pivotal.io>:
>> 
>>> Ming -
>>> 
>>> Great points about append-only. One potential work-around is to split a
>>> table over multiple backend storage objects, (a new file for each append
>>> operation), Then, maybe as part of VACUUM, perform object compaction. For
>>> GCP, the server-side compaction capability for objects is called compose
>>> <https://cloud.google.com/storage/docs/gsutil/commands/compose>. For
>> AWS,
>>> you can emulate this behavior using Multipart upload
>>> <http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadInitiate.html>
-
>>> demonstrated concretely with the Ruby SDK here
>>> <https://aws.amazon.com/blogs/developer/efficient-amazon-s3-
>>> object-concatenation-using-the-aws-sdk-for-ruby/>.
>>> Azure actually supports append-blobs
>>> <https://blogs.msdn.microsoft.com/windowsazurestorage/2015/
>>> 04/13/introducing-azure-storage-append-blob/>
>>> natively.
>>> 
>>> For the FUSE exploration, can you (or anyone else) help me understand the
>>> relationship and/or call graph between these different implementations?
>>> 
>>>   - backend/storage/file/filesystem.c
>>>   - bin/gpfilesystem/hdfs/gpfshdfs.c
>>>   - backend/storage/file/fd.c
>>> 
>>> I feel confident that everything HDFS-related ultimately uses
>>> libhdfs3/src/client/Hdfs.cpp but it seems like a convoluted path for
>>> getting there from the backend code.
>>> 
>>> Also, it looks like normal Postgres allows tablespaces to be created like
>>> this:
>>> 
>>>      CREATE TABLESPACE fastspace LOCATION '/mnt/sda1/postgresql/data';
>>> 
>>> This is much simpler than wrapping glibc calls and is exactly what would
>> be
>>> necessary if using FUSE modules + mount points to handle a "pluggable"
>>> backend. Maybe you (or someone) can advise how much effort it would be to
>>> bring "local:// FS" tablespace support back? It is potentially less than
>>> trying to unravel all the HDFS-specific implementation scattered around
>> the
>>> backend code.
>>> 
>>> 
>>> Thanks,
>>> Kyle
>>> 
>>> On Mon, Mar 13, 2017 at 8:35 PM Ming Li <mli@pivotal.io> wrote:
>>> 
>>>> Hi Kyle,
>>>> 
>>>> Good investigation!
>>>> 
>>>> I think we can add a similar tuple as hdfs in the pg_filesystem at
>> first,
>>>> then implement all API introduce in this tuple to call the FUSE API.
>>>> 
>>>> However because HAWQ are designed for hdfs which means only append-only
>>>> file system, so when we support other types of filesystem, we should
>>>> investigate how to improve the performance and transaction issues. The
>>>> performance can be investigate after we implement a demo, but the
>>>> transaction issue should be decided before. Append only file system
>> don't
>>>> support UPDATE in place, and the inserted data are traced by file
>> length
>>> in
>>>> pg_aoseg.pg_aoseg_xxxxx or pg_parquet.pg_parquet_xxxxx.
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Mar 14, 2017 at 7:57 AM, Kyle Dunn <kdunn@pivotal.io> wrote:
>>>> 
>>>>> Hello devs -
>>>>> 
>>>>> I'm doing some reading about HAWQ tablespaces here:
>>>>> http://hdb.docs.pivotal.io/212/hawq/ddl/ddl-tablespace.html
>>>>> 
>>>>> I want to understand the flow of things, please correct me on the
>>>> following
>>>>> assumptions:
>>>>> 
>>>>> 1) Create a filesystem (not *really* supported after HAWQ init) - the
>>>>> default is obviously [lib]HDFS[3]:
>>>>>      SELECT * FROM pg_filesystem;
>>>>> 
>>>>> 2) Create a filespace, referencing the above file system:
>>>>>      CREATE FILESPACE testfs ON hdfs
>>>>>      ('localhost:8020/fs/testfs') WITH (NUMREPLICA = 1);
>>>>> 
>>>>> 3) Create a tablespace, reference the above filespace:
>>>>>      CREATE TABLESPACE fastspace FILESPACE testfs;
>>>>> 
>>>>> 4) Create objects referencing the above table space, or set it as the
>>>>> database's default:
>>>>>      CREATE DATABASE testdb WITH TABLESPACE=testfs;
>>>>> 
>>>>> Given this set of steps, it it true (*in theory*) an arbitrary
>>> filesystem
>>>>> (i.e. storage backend) could be added to HAWQ using *existing* APIs?
>>>>> 
>>>>> I realize the nuances of this are significant, but conceptually I'd
>>> like
>>>> to
>>>>> gather some details, mainly in support of this
>>>>> <https://issues.apache.org/jira/browse/HAWQ-1270> ongoing JIRA
>>>> discussion.
>>>>> I'm daydreaming about whether this neat tool:
>>>>> https://github.com/s3fs-fuse/s3fs-fuse could be useful for an S3
>> spike
>>>>> (which also seems to kind of work on Google Cloud, when
>>> interoperability
>>>>> <
>>>> https://github.com/s3fs-fuse/s3fs-fuse/issues/109#
>> issuecomment-286222694
>>>> 
>>>>> mode is enabled). By it's Linux FUSE nature, it implements the lion's
>>>> share
>>>>> of required pg_filesystem functions; in fact, maybe we could actually
>>> use
>>>>> system calls from glibc (somewhat <http://www.linux-mag.com/id/7814/
>>> )
>>>>> directly in this situation.
>>>>> 
>>>>> Curious to get some feedback.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Kyle
>>>>> --
>>>>> *Kyle Dunn | Data Engineering | Pivotal*
>>>>> Direct: 303.905.3171 <(303)%20905-3171> <3039053171
>>> <(303)%20905-3171>>
>>>> | Email: kdunn@pivotal.io
>>>>> 
>>>> 
>>> --
>>> *Kyle Dunn | Data Engineering | Pivotal*
>>> Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>>> 
>> 


Mime
View raw message