Return-Path: X-Original-To: apmail-incubator-flume-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-flume-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 975247F3C for ; Tue, 9 Aug 2011 20:07:19 +0000 (UTC) Received: (qmail 72207 invoked by uid 500); 9 Aug 2011 20:07:19 -0000 Delivered-To: apmail-incubator-flume-user-archive@incubator.apache.org Received: (qmail 72138 invoked by uid 500); 9 Aug 2011 20:07:18 -0000 Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: flume-user@incubator.apache.org Delivered-To: mailing list flume-user@incubator.apache.org Received: (qmail 72129 invoked by uid 99); 9 Aug 2011 20:07:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 20:07:18 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of esammer@cloudera.com designates 74.125.82.53 as permitted sender) Received: from [74.125.82.53] (HELO mail-ww0-f53.google.com) (74.125.82.53) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 20:07:14 +0000 Received: by wwf25 with SMTP id 25so299944wwf.22 for ; Tue, 09 Aug 2011 13:06:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.176.131 with SMTP id b3mr949111wem.99.1312920411973; Tue, 09 Aug 2011 13:06:51 -0700 (PDT) Received: by 10.216.36.211 with HTTP; Tue, 9 Aug 2011 13:06:51 -0700 (PDT) In-Reply-To: References: Date: Tue, 9 Aug 2011 13:06:51 -0700 Message-ID: Subject: Re: Best way to extract mysql bin logs? From: Eric Sammer To: flume-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Felix: You definitely need to implement a custom source that knows how to read the bin logs and pack each transaction into an event rather than just tailing it. This will give you a discreet event for each transaction that can be treated as a single unit and make downstream processing MUCH easier. Things to keep in mind: * Flume does NOT guarantee order so make sure each event has a timestamp or transaction ID that you can order by. * Flume does NOT guarantee that you won't get duplicates so make sure you have a globally unique transaction ID so you can deduplicate transactions. This would be interesting functionality to get back into Flume. If you can / want to contribute it back in the form of a custom source, feel free to open a JIRA so others can help / watch progress. Thanks! On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas wrote: > Hi :) ! > > I have a use case where I want to keep a historical record of all the > changes (insert/update/delete) happening on a MySQL DB. > > I am able to tail the bin logs and record them in HDFS, but they are not > easy to parse because one operation is split on many lines. There are some > comments that include the timestamp, the total time it took to execute the > query and other stuff. A lot of this extra info is not relevant, but the > timestamp is important for me, and I thought I might as well keep the rest > of the info as well since the raw data gives me the option of going back to > look for these other fields if I determine later on that I need them. > > Now, the fact that it's split over many lines makes it harder to use with > Map/Reduce. > > I have thought of using a custom M/R RecordReader but I still have the > problem that some of the lines related to one operation will be at the end > of one HDFS file and the rest will be at the beginning of the next HDFS > file, since I am opening and closing those files at an arbitrary roll time. > > I think the easiest way would be to do some minimal ETL at the source. I > think I could use a custom decorator for this. Basically, that decorator > would group together on a single line all the bin log lines that relate to a > single DB operation. The original lines would be separated by semi-colons or > some other character in the final output. > > I wanted to check with you guys to see if that approach made sense. If you > have better suggestions, then I'm all ears, of course. Also, if you think > there is an easier way than reading the bin logs to accomplish my original > goal, then I'd like to hear about it as well :) > > Thanks :) ! > > -- > Felix > > -- Eric Sammer twitter: esammer data: www.cloudera.com