hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Liu <andyliu1...@gmail.com>
Subject Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
Date Wed, 15 Apr 2009 13:44:30 GMT
Not sure if comparing Hadoop to databases is an apples to apples
comparison.  Hadoop is a complete job execution framework, which collocates
the data with the computation.  I suppose DBMS-X and Vertica do that to some
certain extent, by way of SQL, but you're restricted to that.  If you want
to say, build a distributed web crawler, or a complex data processing
pipeline, Hadoop will schedule those processes across a cluster for you,
while Vertica and DBMS-X only deal with the storage of the data.

The choice of experiments seemed skewed towards DBMS-X and Vertica.  I think
everybody is aware that Map-Reduce is inefficient for handling SQL-like
queries and joins.

It's also worth noting that I think 4 out of the 7 authors either currently
or at one time work with Vertica (or c-store, the precursor to Vertica).

Andy

On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
<germoglio@gmail.com>wrote:

> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> --
> Guilherme
>
> msn: guigermoglio@hotmail.com
> homepage: http://germoglio.googlepages.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message