hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi-Kai Tsai <yi...@yahoo-inc.com>
Subject Re: Hadoop / MySQL
Date Tue, 28 Apr 2009 11:45:59 GMT
Hi Ankur

Nice share , btw whats your query behavior ?
I'm asking because if the query is simple or could be 
transform/normalized , you could try output to HBase directly?

Yi-Kai

> hello hadoop users, 
> Recently I had a chance to lead a team building a log-processing system that uses Hadoop
and MySQL. The system's goal was to process the incoming information as quickly as possible
(real time or near real time), and make it available for querying in MySQL. I thought it would
be good to share the experience and the challenges with the community. Couldn't think of a
better place than these mailing lists as I am not much of a blogger :-) 
>
> The information flow in the system looks something like 
>
> [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools] 
>
> Transferring from Apache-Servers to Hadoop was quite easy as we just had to organize
the data in timely buckets (directories). Once that was running smooth we had to make sure
that map-reduce jobs are fired at regular intervals and they pick up the right data. The jobs
would then process/aggregate the date and dump the info into MySQL shards from the reducers
[we have our own DB partioning set up]. This is where we hit major bottlenecks [any surprises?
:-)] 
>
> The table engine used was InnoDB as there was a need for fast replication and writes
but only moderate reads (should eventually support high read rates). The data would take up
quite a while to load completely far away from being near-real time. And so our optimization
journey begin. 
>
> 1. We tried to optimize/tune InnoDB parameters like increasing the buffer pool size to
75 % of available RAM. This helped but only till the time DBs were lightly loaded i.e. innoDB
had sufficient buffer pool to host the data and indexes. 
>
> 2. We also realized that InnoDB has considerable locking overhead because of which write
concurrency is really bad when you have a large number of concurrent threads doing writes.
The default thread concurrency for us was set to no_of_cpu * 2 = 8 which is what the official
documentation advises as the optimal limit. So we limited the number of reduce tasks and consequently
the number of concurrent writes and boy the performance improved 4x. We were almost there
:-) 
>
> 3. Next thing we tried is the standard DB optimzation techniques like de-normalizing
the schema and dropping constraints. This gave only a minor performance improvement, nothing
earth shattering. Note that we were already caching connections in reducers to each MySQL
shard and partionining logic was embedded into reducers. 
>
> 4. Falling still short of our performance objectives, we finally we decided to get rid
of JDBC writes from reducers and work on an alternative that uses MySQLs LOAD utility. 
> - The processing would partition the data into MySQL shard specific files resident in
HDFS. 
> - A script would then spawn processes via ssh on different physical machines to download
this data. 
> - Each spawned process just downloads the data for the shard it should upload to. 
> - All the processes then start uploading data in parallel into their respective MySQL
shards using LOAD DATA infile. 
>
> This proved to be the fastest approach, even in the wake of increasing data loads. The
enitre processing/loading would complete in less than 6 min. The system has been holding up
quite well so far, even though we've had to limit the number of days for which we keep the
data or else the MySQLs get overwhelmed. 
>
> Hope this is helpful to people. 
>
> Regards 
> -Ankur 
>   


-- 
Yi-Kai Tsai (cuma) <yikai@yahoo-inc.com>, Asia Search Engineering.


Mime
View raw message