Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: flume-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of esammer@cloudera.com
 designates 74.125.82.53 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAECHK7vBdv0HPpM8YLQyeHv+yq0NWCjHc8XWaxw2X6J3HdbLng@mail.gmail.com>
References: 
 <CAECHK7vBdv0HPpM8YLQyeHv+yq0NWCjHc8XWaxw2X6J3HdbLng@mail.gmail.com>
Date: Tue, 9 Aug 2011 13:06:51 -0700
Message-ID: 
 <CANpnisr6gpPXJK62Sazv9Y0o5-A_vT3htahRpQNN9dTghFujxg@mail.gmail.com>
Subject: Re: Best way to extract mysql bin logs?
From: Eric Sammer <esammer@cloudera.com>
To: flume-user@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Felix:

You definitely need to implement a custom source that knows how to
read the bin logs and pack each transaction into an event rather than
just tailing it. This will give you a discreet event for each
transaction that can be treated as a single unit and make downstream
processing MUCH easier.

Things to keep in mind:
* Flume does NOT guarantee order so make sure each event has a
timestamp or transaction ID that you can order by.
* Flume does NOT guarantee that you won't get duplicates so make sure
you have a globally unique transaction ID so you can deduplicate
transactions.

This would be interesting functionality to get back into Flume. If you
can / want to contribute it back in the form of a custom source, feel
free to open a JIRA so others can help / watch progress.

Thanks!

On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
<felix.giguere@mate1inc.com> wrote:
> Hi :) !
>
> I have a use case where I want to keep a historical record of all the
> changes (insert/update/delete) happening on a MySQL DB.
>
> I am able to tail the bin logs and record them in HDFS, but they are not
> easy to parse because one operation is split on many lines. There are some
> comments that include the timestamp, the total time it took to execute the
> query and other stuff. A lot of this extra info is not relevant, but the
> timestamp is important for me, and I thought I might as well keep the rest
> of the info as well since the raw data gives me the option of going back to
> look for these other fields if I determine later on that I need them.
>
> Now, the fact that it's split over many lines makes it harder to use with
> Map/Reduce.
>
> I have thought of using a custom M/R RecordReader but I still have the
> problem that some of the lines related to one operation will be at the end
> of one HDFS file and the rest will be at the beginning of the next HDFS
> file, since I am opening and closing those files at an arbitrary roll time.
>
> I think the easiest way would be to do some minimal ETL at the source. I
> think I could use a custom decorator for this. Basically, that decorator
> would group together on a single line all the bin log lines that relate to a
> single DB operation. The original lines would be separated by semi-colons or
> some other character in the final output.
>
> I wanted to check with you guys to see if that approach made sense. If you
> have better suggestions, then I'm all ears, of course. Also, if you think
> there is an easier way than reading the bin logs to accomplish my original
> goal, then I'd like to hear about it as well :)
>
> Thanks :) !
>
> --
> Felix
>
>


-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com