incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: Best way to extract mysql bin logs?
Date Thu, 11 Aug 2011 23:25:36 GMT
Good to hear.  I think it would be interesting if you wrote up a little blog
post about your usecase.  Sounds interesting!

Jon.

On Thu, Aug 11, 2011 at 2:00 PM, Felix Giguere Villegas <
felix.giguere@mate1inc.com> wrote:

> Quick update.
>
> Sometimes the simplest solutions are the best hehe... We decided to just
> use a perl script to modify the data as it's being outputted by mysqlbinlog
>
> perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}'
>
> We write this modified binlog to a file and tail it using flume. This way,
> mysql events are grouped per line, and even if they get to their destination
> in the wrong order, it's easy to sort them correctly downstream :)
>
> Eventually, we will probably wrap it all up in a script and use the execStream
> Flume source. This way, we won't need to write an intermediate/temporary
> file to disk, which will allow us to have greater disk IO throughput for DFO
> or E2E reliability...
>
> --
> Felix
>
>
>
>
> On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas <
> felix.giguere@mate1inc.com> wrote:
>
>> Ah, so a source would make more sense than a decorator then. I see...
>>
>> We are definitely open to the idea of contributing back, and this is
>> indeed probably something that a lot of people could use...
>>
>> We are still evaluating what we will do, as we have a lot of stuff going
>> on at once, but if we do decide to develop a custom source, then I'll do as
>> you suggest and open a JIRA issue.
>>
>> Thanks for your time and info :)
>>
>> --
>> Felix
>>
>>
>>
>>
>> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <esammer@cloudera.com> wrote:
>>
>>> Felix:
>>>
>>> You definitely need to implement a custom source that knows how to
>>> read the bin logs and pack each transaction into an event rather than
>>> just tailing it. This will give you a discreet event for each
>>> transaction that can be treated as a single unit and make downstream
>>> processing MUCH easier.
>>>
>>> Things to keep in mind:
>>> * Flume does NOT guarantee order so make sure each event has a
>>> timestamp or transaction ID that you can order by.
>>> * Flume does NOT guarantee that you won't get duplicates so make sure
>>> you have a globally unique transaction ID so you can deduplicate
>>> transactions.
>>>
>>> This would be interesting functionality to get back into Flume. If you
>>> can / want to contribute it back in the form of a custom source, feel
>>> free to open a JIRA so others can help / watch progress.
>>>
>>> Thanks!
>>>
>>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
>>> <felix.giguere@mate1inc.com> wrote:
>>> > Hi :) !
>>> >
>>> > I have a use case where I want to keep a historical record of all the
>>> > changes (insert/update/delete) happening on a MySQL DB.
>>> >
>>> > I am able to tail the bin logs and record them in HDFS, but they are
>>> not
>>> > easy to parse because one operation is split on many lines. There are
>>> some
>>> > comments that include the timestamp, the total time it took to execute
>>> the
>>> > query and other stuff. A lot of this extra info is not relevant, but
>>> the
>>> > timestamp is important for me, and I thought I might as well keep the
>>> rest
>>> > of the info as well since the raw data gives me the option of going
>>> back to
>>> > look for these other fields if I determine later on that I need them.
>>> >
>>> > Now, the fact that it's split over many lines makes it harder to use
>>> with
>>> > Map/Reduce.
>>> >
>>> > I have thought of using a custom M/R RecordReader but I still have the
>>> > problem that some of the lines related to one operation will be at the
>>> end
>>> > of one HDFS file and the rest will be at the beginning of the next HDFS
>>> > file, since I am opening and closing those files at an arbitrary roll
>>> time.
>>> >
>>> > I think the easiest way would be to do some minimal ETL at the source.
>>> I
>>> > think I could use a custom decorator for this. Basically, that
>>> decorator
>>> > would group together on a single line all the bin log lines that relate
>>> to a
>>> > single DB operation. The original lines would be separated by
>>> semi-colons or
>>> > some other character in the final output.
>>> >
>>> > I wanted to check with you guys to see if that approach made sense. If
>>> you
>>> > have better suggestions, then I'm all ears, of course. Also, if you
>>> think
>>> > there is an easier way than reading the bin logs to accomplish my
>>> original
>>> > goal, then I'd like to hear about it as well :)
>>> >
>>> > Thanks :) !
>>> >
>>> > --
>>> > Felix
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Eric Sammer
>>> twitter: esammer
>>> data: www.cloudera.com
>>>
>>
>>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Mime
View raw message