Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of peter.skomoroch@gmail.com
 designates 209.85.217.215 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=QZiNEmDYod9wn8qJkJbxzz8ExkR5EfHLJACz/xaXWHFRuVS5BXW92tUIz4sOWr64wU
         Z8T6cXIAa5m6evBH+qkTLh7YMUPvhtFPG4haBfAt5bFGPGxVzrTN1S2/6pu90+vHGguS
         i335brfMFR2ZaKZgaSxmmnjUZhT+O/XsgfAIg=
MIME-Version: 1.0
In-Reply-To: <14627742.451240915608268.JavaMail.ankur@localhost.localdomain>
References: <6416040.431240915475918.JavaMail.ankur@localhost.localdomain>
	 <14627742.451240915608268.JavaMail.ankur@localhost.localdomain>
Date: Tue, 28 Apr 2009 10:20:55 -0400
Message-ID: <e4fc0d2a0904280720j71985d36v338f944a7df7788@mail.gmail.com>
Subject: Re: Hadoop / MySQL
From: Peter Skomoroch <peter.skomoroch@gmail.com>
To: core-user@hadoop.apache.org
Cc: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd3beb2d0508204689e2b98

--000e0cd3beb2d0508204689e2b98
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Thanks for sharing sounds like a nice system - I always advise people to
avoid direct SQL inserts for batch jobs / large amounts of data and use
MySQL's optimized LOAD utility like you did.  Same goes for Oracle...
Nothing brings a DB server to its knees like a ton of individual inserts on
indexed tables..

On Tue, Apr 28, 2009 at 6:46 AM, Ankur Goel <ankur.goel@corp.aol.com> wrote:

>
> hello hadoop users,
> Recently I had a chance to lead a team building a log-processing system
> that uses Hadoop and MySQL. The system's goal was to process the incoming
> information as quickly as possible (real time or near real time), and make
> it available for querying in MySQL. I thought it would be good to share the
> experience and the challenges with the community. Couldn't think of a better
> place than these mailing lists as I am not much of a blogger :-)
>
> The information flow in the system looks something like
>
> [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools]
>
> Transferring from Apache-Servers to Hadoop was quite easy as we just had to
> organize the data in timely buckets (directories). Once that was running
> smooth we had to make sure that map-reduce jobs are fired at regular
> intervals and they pick up the right data. The jobs would then
> process/aggregate the date and dump the info into MySQL shards from the
> reducers [we have our own DB partioning set up]. This is where we hit major
> bottlenecks [any surprises? :-)]
>
> The table engine used was InnoDB as there was a need for fast replication
> and writes but only moderate reads (should eventually support high read
> rates). The data would take up quite a while to load completely far away
> from being near-real time. And so our optimization journey begin.
>
> 1. We tried to optimize/tune InnoDB parameters like increasing the buffer
> pool size to 75 % of available RAM. This helped but only till the time DBs
> were lightly loaded i.e. innoDB had sufficient buffer pool to host the data
> and indexes.
>
> 2. We also realized that InnoDB has considerable locking overhead because
> of which write concurrency is really bad when you have a large number of
> concurrent threads doing writes. The default thread concurrency for us was
> set to no_of_cpu * 2 = 8 which is what the official documentation advises as
> the optimal limit. So we limited the number of reduce tasks and consequently
> the number of concurrent writes and boy the performance improved 4x. We were
> almost there :-)
>
> 3. Next thing we tried is the standard DB optimzation techniques like
> de-normalizing the schema and dropping constraints. This gave only a minor
> performance improvement, nothing earth shattering. Note that we were already
> caching connections in reducers to each MySQL shard and partionining logic
> was embedded into reducers.
>
> 4. Falling still short of our performance objectives, we finally we decided
> to get rid of JDBC writes from reducers and work on an alternative that uses
> MySQLs LOAD utility.
> - The processing would partition the data into MySQL shard specific files
> resident in HDFS.
> - A script would then spawn processes via ssh on different physical
> machines to download this data.
> - Each spawned process just downloads the data for the shard it should
> upload to.
> - All the processes then start uploading data in parallel into their
> respective MySQL shards using LOAD DATA infile.
>
> This proved to be the fastest approach, even in the wake of increasing data
> loads. The enitre processing/loading would complete in less than 6 min. The
> system has been holding up quite well so far, even though we've had to limit
> the number of days for which we keep the data or else the MySQLs get
> overwhelmed.
>
> Hope this is helpful to people.
>
> Regards
> -Ankur
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

--000e0cd3beb2d0508204689e2b98--