incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Giguere Villegas <felix.gigu...@mate1inc.com>
Subject Best way to extract mysql bin logs?
Date Tue, 09 Aug 2011 18:42:37 GMT
Hi :) !

I have a use case where I want to keep a historical record of all the
changes (insert/update/delete) happening on a MySQL DB.

I am able to tail the bin logs and record them in HDFS, but they are not
easy to parse because one operation is split on many lines. There are some
comments that include the timestamp, the total time it took to execute the
query and other stuff. A lot of this extra info is not relevant, but the
timestamp is important for me, and I thought I might as well keep the rest
of the info as well since the raw data gives me the option of going back to
look for these other fields if I determine later on that I need them.

Now, the fact that it's split over many lines makes it harder to use with
Map/Reduce.

I have thought of using a custom M/R RecordReader but I still have the
problem that some of the lines related to one operation will be at the end
of one HDFS file and the rest will be at the beginning of the next HDFS
file, since I am opening and closing those files at an arbitrary roll time.

I think the easiest way would be to do some minimal ETL at the source. I
think I could use a custom decorator for this. Basically, that decorator
would group together on a single line all the bin log lines that relate to a
single DB operation. The original lines would be separated by semi-colons or
some other character in the final output.

I wanted to check with you guys to see if that approach made sense. If you
have better suggestions, then I'm all ears, of course. Also, if you think
there is an easier way than reading the bin logs to accomplish my original
goal, then I'd like to hear about it as well :)

Thanks :) !

--
Felix

Mime
View raw message