hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Hadoop / MySQL
Date Wed, 29 Apr 2009 18:48:17 GMT
On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski <spodxx@gmail.com>wrote:

> If you have trouble loading your data into mysql using INSERTs or LOAD
> DATA, consider that MySQL supports CSV directly using the CSV storage
> engine. The only thing you have to do is to copy your hadoop produced
> csv file into the mysql data directory and issue a "flush tables"
> command to have mysql flush its caches and pickup the new file. Its
> very simple and you have the full set of sql commands available just
> as with innodb or myisam. What you don't get with the csv engine are
> indexes and foreign keys. Can't have it all, can you?

The CSV storage engine is definitely an interesting option, but it has a
couple downsides:

- Like you mentioned, you don't get indexes. This seems like a huge deal to
me - the reason you want to load data into MySQL instead of just keeping it
in Hadoop is so you can service real-time queries. Not having any indexing
kind of defeats the purpose there. This is especially true since MySQL only
supports nested-loop joins, and there's no way of attaching metadata to a
CSV table to say "hey look, this table is already in sorted order so you can
use a merge join".

- Since CSV is a text based format, it's likely to be a lot less compact
than a proper table. For example, a unix timestamp is likely to be ~10
characters vs 4 bytes in a packed table.

- I'm not aware of many people actually using CSV for anything except
tutorials and training. Since it's not in heavy use by big mysql users, I
wouldn't build a production system around it.

Here's a wacky idea that I might be interested in hacking up if anyone's

What if there were a MyISAMTableOutputFormat in hadoop? You could use this
as a reducer output and have it actually output .frm and .myd files onto
HDFS, then simply hdfs -get them onto DB servers for realtime serving.
Sounds like a fun hack I might be interested in if people would find it
useful. Building the .myi indexes in Hadoop would be pretty killer as well,
but potentially more difficult.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message