incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Giguere Villegas <felix.gigu...@mate1inc.com>
Subject Re: Best way to extract mysql bin logs?
Date Tue, 09 Aug 2011 20:34:09 GMT
Ah, so a source would make more sense than a decorator then. I see...

We are definitely open to the idea of contributing back, and this is indeed
probably something that a lot of people could use...

We are still evaluating what we will do, as we have a lot of stuff going on
at once, but if we do decide to develop a custom source, then I'll do as you
suggest and open a JIRA issue.

Thanks for your time and info :)

--
Felix



On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <esammer@cloudera.com> wrote:

> Felix:
>
> You definitely need to implement a custom source that knows how to
> read the bin logs and pack each transaction into an event rather than
> just tailing it. This will give you a discreet event for each
> transaction that can be treated as a single unit and make downstream
> processing MUCH easier.
>
> Things to keep in mind:
> * Flume does NOT guarantee order so make sure each event has a
> timestamp or transaction ID that you can order by.
> * Flume does NOT guarantee that you won't get duplicates so make sure
> you have a globally unique transaction ID so you can deduplicate
> transactions.
>
> This would be interesting functionality to get back into Flume. If you
> can / want to contribute it back in the form of a custom source, feel
> free to open a JIRA so others can help / watch progress.
>
> Thanks!
>
> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
> <felix.giguere@mate1inc.com> wrote:
> > Hi :) !
> >
> > I have a use case where I want to keep a historical record of all the
> > changes (insert/update/delete) happening on a MySQL DB.
> >
> > I am able to tail the bin logs and record them in HDFS, but they are not
> > easy to parse because one operation is split on many lines. There are
> some
> > comments that include the timestamp, the total time it took to execute
> the
> > query and other stuff. A lot of this extra info is not relevant, but the
> > timestamp is important for me, and I thought I might as well keep the
> rest
> > of the info as well since the raw data gives me the option of going back
> to
> > look for these other fields if I determine later on that I need them.
> >
> > Now, the fact that it's split over many lines makes it harder to use with
> > Map/Reduce.
> >
> > I have thought of using a custom M/R RecordReader but I still have the
> > problem that some of the lines related to one operation will be at the
> end
> > of one HDFS file and the rest will be at the beginning of the next HDFS
> > file, since I am opening and closing those files at an arbitrary roll
> time.
> >
> > I think the easiest way would be to do some minimal ETL at the source. I
> > think I could use a custom decorator for this. Basically, that decorator
> > would group together on a single line all the bin log lines that relate
> to a
> > single DB operation. The original lines would be separated by semi-colons
> or
> > some other character in the final output.
> >
> > I wanted to check with you guys to see if that approach made sense. If
> you
> > have better suggestions, then I'm all ears, of course. Also, if you think
> > there is an easier way than reading the bin logs to accomplish my
> original
> > goal, then I'd like to hear about it as well :)
> >
> > Thanks :) !
> >
> > --
> > Felix
> >
> >
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>

Mime
View raw message