hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Skomoroch <peter.skomor...@gmail.com>
Subject Re: Hadoop / MySQL
Date Tue, 28 Apr 2009 14:20:55 GMT
Thanks for sharing sounds like a nice system - I always advise people to
avoid direct SQL inserts for batch jobs / large amounts of data and use
MySQL's optimized LOAD utility like you did.  Same goes for Oracle...
Nothing brings a DB server to its knees like a ton of individual inserts on
indexed tables..

On Tue, Apr 28, 2009 at 6:46 AM, Ankur Goel <ankur.goel@corp.aol.com> wrote:

> hello hadoop users,
> Recently I had a chance to lead a team building a log-processing system
> that uses Hadoop and MySQL. The system's goal was to process the incoming
> information as quickly as possible (real time or near real time), and make
> it available for querying in MySQL. I thought it would be good to share the
> experience and the challenges with the community. Couldn't think of a better
> place than these mailing lists as I am not much of a blogger :-)
> The information flow in the system looks something like
> [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools]
> Transferring from Apache-Servers to Hadoop was quite easy as we just had to
> organize the data in timely buckets (directories). Once that was running
> smooth we had to make sure that map-reduce jobs are fired at regular
> intervals and they pick up the right data. The jobs would then
> process/aggregate the date and dump the info into MySQL shards from the
> reducers [we have our own DB partioning set up]. This is where we hit major
> bottlenecks [any surprises? :-)]
> The table engine used was InnoDB as there was a need for fast replication
> and writes but only moderate reads (should eventually support high read
> rates). The data would take up quite a while to load completely far away
> from being near-real time. And so our optimization journey begin.
> 1. We tried to optimize/tune InnoDB parameters like increasing the buffer
> pool size to 75 % of available RAM. This helped but only till the time DBs
> were lightly loaded i.e. innoDB had sufficient buffer pool to host the data
> and indexes.
> 2. We also realized that InnoDB has considerable locking overhead because
> of which write concurrency is really bad when you have a large number of
> concurrent threads doing writes. The default thread concurrency for us was
> set to no_of_cpu * 2 = 8 which is what the official documentation advises as
> the optimal limit. So we limited the number of reduce tasks and consequently
> the number of concurrent writes and boy the performance improved 4x. We were
> almost there :-)
> 3. Next thing we tried is the standard DB optimzation techniques like
> de-normalizing the schema and dropping constraints. This gave only a minor
> performance improvement, nothing earth shattering. Note that we were already
> caching connections in reducers to each MySQL shard and partionining logic
> was embedded into reducers.
> 4. Falling still short of our performance objectives, we finally we decided
> to get rid of JDBC writes from reducers and work on an alternative that uses
> MySQLs LOAD utility.
> - The processing would partition the data into MySQL shard specific files
> resident in HDFS.
> - A script would then spawn processes via ssh on different physical
> machines to download this data.
> - Each spawned process just downloads the data for the shard it should
> upload to.
> - All the processes then start uploading data in parallel into their
> respective MySQL shards using LOAD DATA infile.
> This proved to be the fastest approach, even in the wake of increasing data
> loads. The enitre processing/loading would complete in less than 6 min. The
> system has been holding up quite well so far, even though we've had to limit
> the number of days for which we keep the data or else the MySQLs get
> overwhelmed.
> Hope this is helpful to people.
> Regards
> -Ankur

Peter N. Skomoroch

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message