incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: Best way to extract mysql bin logs?
Date Tue, 09 Aug 2011 20:06:51 GMT
Felix:

You definitely need to implement a custom source that knows how to
read the bin logs and pack each transaction into an event rather than
just tailing it. This will give you a discreet event for each
transaction that can be treated as a single unit and make downstream
processing MUCH easier.

Things to keep in mind:
* Flume does NOT guarantee order so make sure each event has a
timestamp or transaction ID that you can order by.
* Flume does NOT guarantee that you won't get duplicates so make sure
you have a globally unique transaction ID so you can deduplicate
transactions.

This would be interesting functionality to get back into Flume. If you
can / want to contribute it back in the form of a custom source, feel
free to open a JIRA so others can help / watch progress.

Thanks!

On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
<felix.giguere@mate1inc.com> wrote:
> Hi :) !
>
> I have a use case where I want to keep a historical record of all the
> changes (insert/update/delete) happening on a MySQL DB.
>
> I am able to tail the bin logs and record them in HDFS, but they are not
> easy to parse because one operation is split on many lines. There are some
> comments that include the timestamp, the total time it took to execute the
> query and other stuff. A lot of this extra info is not relevant, but the
> timestamp is important for me, and I thought I might as well keep the rest
> of the info as well since the raw data gives me the option of going back to
> look for these other fields if I determine later on that I need them.
>
> Now, the fact that it's split over many lines makes it harder to use with
> Map/Reduce.
>
> I have thought of using a custom M/R RecordReader but I still have the
> problem that some of the lines related to one operation will be at the end
> of one HDFS file and the rest will be at the beginning of the next HDFS
> file, since I am opening and closing those files at an arbitrary roll time.
>
> I think the easiest way would be to do some minimal ETL at the source. I
> think I could use a custom decorator for this. Basically, that decorator
> would group together on a single line all the bin log lines that relate to a
> single DB operation. The original lines would be separated by semi-colons or
> some other character in the final output.
>
> I wanted to check with you guys to see if that approach made sense. If you
> have better suggestions, then I'm all ears, of course. Also, if you think
> there is an easier way than reading the bin logs to accomplish my original
> goal, then I'd like to hear about it as well :)
>
> Thanks :) !
>
> --
> Felix
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Mime
View raw message