incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Giguere Villegas <felix.gigu...@mate1inc.com>
Subject Re: Best way to extract mysql bin logs?
Date Fri, 12 Aug 2011 13:40:32 GMT
Yes I'm planning to do that once we get something a little more concrete
working from end to end.

I'll let you guys know for sure :)

--
Felix



On Thu, Aug 11, 2011 at 7:25 PM, Jonathan Hsieh <jon@cloudera.com> wrote:

> Good to hear.  I think it would be interesting if you wrote up a little
> blog post about your usecase.  Sounds interesting!
>
> Jon.
>
>
> On Thu, Aug 11, 2011 at 2:00 PM, Felix Giguere Villegas <
> felix.giguere@mate1inc.com> wrote:
>
>> Quick update.
>>
>> Sometimes the simplest solutions are the best hehe... We decided to just
>> use a perl script to modify the data as it's being outputted by mysqlbinlog
>>
>> perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}'
>>
>> We write this modified binlog to a file and tail it using flume. This way,
>> mysql events are grouped per line, and even if they get to their destination
>> in the wrong order, it's easy to sort them correctly downstream :)
>>
>> Eventually, we will probably wrap it all up in a script and use the execStream
>> Flume source. This way, we won't need to write an intermediate/temporary
>> file to disk, which will allow us to have greater disk IO throughput for DFO
>> or E2E reliability...
>>
>> --
>> Felix
>>
>>
>>
>>
>> On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas <
>> felix.giguere@mate1inc.com> wrote:
>>
>>> Ah, so a source would make more sense than a decorator then. I see...
>>>
>>> We are definitely open to the idea of contributing back, and this is
>>> indeed probably something that a lot of people could use...
>>>
>>> We are still evaluating what we will do, as we have a lot of stuff going
>>> on at once, but if we do decide to develop a custom source, then I'll do as
>>> you suggest and open a JIRA issue.
>>>
>>> Thanks for your time and info :)
>>>
>>> --
>>> Felix
>>>
>>>
>>>
>>>
>>> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <esammer@cloudera.com>wrote:
>>>
>>>> Felix:
>>>>
>>>> You definitely need to implement a custom source that knows how to
>>>> read the bin logs and pack each transaction into an event rather than
>>>> just tailing it. This will give you a discreet event for each
>>>> transaction that can be treated as a single unit and make downstream
>>>> processing MUCH easier.
>>>>
>>>> Things to keep in mind:
>>>> * Flume does NOT guarantee order so make sure each event has a
>>>> timestamp or transaction ID that you can order by.
>>>> * Flume does NOT guarantee that you won't get duplicates so make sure
>>>> you have a globally unique transaction ID so you can deduplicate
>>>> transactions.
>>>>
>>>> This would be interesting functionality to get back into Flume. If you
>>>> can / want to contribute it back in the form of a custom source, feel
>>>> free to open a JIRA so others can help / watch progress.
>>>>
>>>> Thanks!
>>>>
>>>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
>>>> <felix.giguere@mate1inc.com> wrote:
>>>> > Hi :) !
>>>> >
>>>> > I have a use case where I want to keep a historical record of all the
>>>> > changes (insert/update/delete) happening on a MySQL DB.
>>>> >
>>>> > I am able to tail the bin logs and record them in HDFS, but they are
>>>> not
>>>> > easy to parse because one operation is split on many lines. There are
>>>> some
>>>> > comments that include the timestamp, the total time it took to execute
>>>> the
>>>> > query and other stuff. A lot of this extra info is not relevant, but
>>>> the
>>>> > timestamp is important for me, and I thought I might as well keep the
>>>> rest
>>>> > of the info as well since the raw data gives me the option of going
>>>> back to
>>>> > look for these other fields if I determine later on that I need them.
>>>> >
>>>> > Now, the fact that it's split over many lines makes it harder to use
>>>> with
>>>> > Map/Reduce.
>>>> >
>>>> > I have thought of using a custom M/R RecordReader but I still have the
>>>> > problem that some of the lines related to one operation will be at the
>>>> end
>>>> > of one HDFS file and the rest will be at the beginning of the next
>>>> HDFS
>>>> > file, since I am opening and closing those files at an arbitrary roll
>>>> time.
>>>> >
>>>> > I think the easiest way would be to do some minimal ETL at the source.
>>>> I
>>>> > think I could use a custom decorator for this. Basically, that
>>>> decorator
>>>> > would group together on a single line all the bin log lines that
>>>> relate to a
>>>> > single DB operation. The original lines would be separated by
>>>> semi-colons or
>>>> > some other character in the final output.
>>>> >
>>>> > I wanted to check with you guys to see if that approach made sense.
If
>>>> you
>>>> > have better suggestions, then I'm all ears, of course. Also, if you
>>>> think
>>>> > there is an easier way than reading the bin logs to accomplish my
>>>> original
>>>> > goal, then I'd like to hear about it as well :)
>>>> >
>>>> > Thanks :) !
>>>> >
>>>> > --
>>>> > Felix
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Eric Sammer
>>>> twitter: esammer
>>>> data: www.cloudera.com
>>>>
>>>
>>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>

Mime
View raw message