hawq-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adaryl \"Bob\" Wakefield, MBA" <adaryl.wakefi...@hotmail.com>
Subject Re: Can distributed DBs replace HDFS?
Date Fri, 20 Nov 2015 03:01:54 GMT
“long term storage, reporting, analytics”

Don’t you get these things with Hawq or am I misunderstanding something?

Adaryl "Bob" Wakefield, MBA
Mass Street Analytics, LLC
Twitter: @BobLovesData

From: Rowland Gosling 
Sent: Thursday, November 19, 2015 8:16 PM
To: user@hawq.incubator.apache.org 
Subject: RE: Can distributed DBs replace HDFS?

There isn’t an either/or proposition in modern data systems e.g. it’s not either you choose
distributed databases or HDFS. If you have complex systems there’s a good chance you need


For example, there’s no way I want my banking system running on anything less than a traditional
RDBMS with all its guarantees. Conversely, I can’t see large financial institutions not
leveraging HDFS in some capacity: long term storage, reporting, analytics. 


Both is the right answer in many cases.


Rowland Gosling

Senior Consultant

Pragmatic Works, Inc.



From: Adaryl "Bob" Wakefield, MBA [mailto:adaryl.wakefield@hotmail.com] 
Sent: Thursday, November 19, 2015 8:00 PM
To: user@hawq.incubator.apache.org
Subject: Re: Can distributed DBs replace HDFS?


You had me right up to the last paragraph. I’m coming from the standpoint of engineering
big data systems. There is a bewildering array of technologies and techniques that are currently
being used. As a matter of fact, I think either Nathan Marz or Jay Kreps gave a speech literally
titled “Reducing complexity in big data systems”. We lived with relational databases for
years. We had to move away from them because the data got ridiculous and traditional databases
couldn’t keep up. 


Now there are more types of databases than you can shake a stick at: column stores, graph
dbs, document dbs and each one requires a different modeling technique which means each one
has a learning curve for anybody new to NoSQL. 


If you’re going to design and implement a big data system, at a minimum you’re going to
need to know Java, Git, some flavor of Linux, some build tool (ant, maven, gradle, etc), and
all that is before we even start storing the data. If you’re coming from a non computer
science background, the amount of stuff you need to put in your head can quite literally blow
some people out of the career field (because who wants to learn new stuff at 40?). 


So I’ve been watching with excitement at the rise of “NewSQL” databases like MemSQL,
and VoltDB because you can use familiar skills to build big data systems. Instead of having
to write code to serialize an object to a file format, you can go back to just executing an
insert statement. You can model data like you’re used to modeling. However, those tools
are for a transactional use case. When I look at Hawq and Greenplumb I see what we have are
relational databases that can handle big data and an analytics use case. 


>From an analytics stand point, most analytic tools I’m aware of try to mimic SQL. There
is HiveSQL, Drill, SparkSQL, SQL DF package in R. Some of these tools aren’t fully SQL compliant
and have quirks (more stuff to learn). Hawq/Greenplumb gets us back to legit SQL.


So when I talk about leaving HDFS for a distributed DB, what I’m talking about is simplifying
the work necessary to store data by not having to know Java/MapReduce, or having to worry
about file formats and compression schemes, or having to have another tool that lets analyst
query the data as if it were a relational database. Let’s just put the data in an actual
relational database. 


The key is to have a distributed database that is up to the challenge of modern data management.
Are we there yet or is there more work to do?


Adaryl "Bob" Wakefield, MBA
Mass Street Analytics, LLC
Twitter: @BobLovesData


From: Caleb Welton 

Sent: Thursday, November 19, 2015 1:11 PM

To: user@hawq.incubator.apache.org 

Subject: Re: Can distributed DBs replace HDFS?


Q: Is anybody using Hawq in production?


Separate answers depending on context.


• Is anybody using HAWQ in production?


Yes, prior to incubating HAWQ in Apache HAWQ was sold by Pivotal and there are Pivotal customers
that have the pre-apache code line in production today.


• Is anybody using Apache HAWQ in production?  


HAWQ was incubated into Apache very recently and we have not yet had an official Apache release.
 The code in Apache is based on the next major release and is currently in a pre-beta state
of stability.  We have a release motion in process for a beta release that should be available
soon, but there is no one that I am aware of currently in production with code based off the
Apache HAWQ codeline.


Q: What would be faster, placing stuff in HDFS or inserts directly into a distributed database.


Generally speaking HDFS does add some overhead over a distributed database sitting on bare
metal.  However in both cases there is need for replicating data to ensure that the distributed
system has built in mechanisms for fault tolerance and so the primary cost will be a comparison
of the overhead related to replication mechanisms in HDFS compared to special purpose mechanisms
in the DRDBMS.  One of the clearest comparisons there would be comparing HAWQ with the Greenplum
database (also recently open sourced) as they are both based on the same fundamental RDBMS
architecture, but HAWQ has been adapted to the Hadoop ecosystem and Greenplum has been optimized
for maximum bare metal speed.


That said there are other advantages that you get from a Hadoop based system beyond pure speed.
 These include greater elasticity, better integration with other Hadoop components, and builtin
cross system resource management through components such as YARN.  If these benefits are not
of interest and your only concern is speed then the Greenplum database may be a better choice.


Q: Does HAWQ store data in plain text format?


No.  HAWQ supports multiple data formats for input, including its own format, Parquet format,
and access of a variety of other data formats via external data access mechanisms including
PXF.  Our support for our builtin format and Parquet comes complete with MVCC transaction
snapshot isolation mechanisms which is a significant advantage if you want to be able to ensure
transactional data loading mechanisms.


Q: Can we leave behind HDFS and design high speed BI systems without all the extra IQ points
required to deal with writing and reading to HDFS?


Fundamentally one of the key advantages of a system designed based on open components and
for a broader ecosystem is that SQL, while an extremely import capability, is just part of
the puzzle for many modern businesses.  There are things that you can do with MapReduce/Pig/Spark/etc
that are not well expressed in SQL and having a shared data store and data formats that allow
multiple backend processing systems to share data, be managed by a single resource management
system, is something that can provide additional flexibility and enhanced capabilities.


Does that help?





On Thu, Nov 19, 2015 at 10:41 AM, Adaryl Wakefield <adaryl.wakefield@hotmail.com> wrote:

  Is anybody using Hawq in production? Today I was thinking about speed and what would be
faster. Placing stuff on HDFS or inserts into a distributed database. Coming from a structured
data background, I haven't entirely wrapped my head around storing data in plain text format.
I know you can use stuff like Avro and Parquet to enforce schema but it's still just binary
data on disk without all the grantees that you've come to expect from relational databases
over the past 20 years. In a perfect world, I'd like to have all the awesomeness of HDFS but
ease of use of relational databases. My question is, are we there yet? Can we leave behind
HDFS (or just be abstracted away from it) and design high speed BI systems without all the
extra IQ points required to deal with writing and reading to HDFS?




View raw message