Ah, so a source would make more sense than a decorator then. I see...
We are definitely open to the idea of contributing back, and this is indeed probably something that a lot of people could use...
We are still evaluating what we will do, as we have a lot of stuff going on at once, but if we do decide to develop a custom source, then I'll do as you suggest and open a JIRA issue.
Thanks for your time and info :)
You definitely need to implement a custom source that knows how to
read the bin logs and pack each transaction into an event rather than
just tailing it. This will give you a discreet event for each
transaction that can be treated as a single unit and make downstream
processing MUCH easier.
Things to keep in mind:
* Flume does NOT guarantee order so make sure each event has a
timestamp or transaction ID that you can order by.
* Flume does NOT guarantee that you won't get duplicates so make sure
you have a globally unique transaction ID so you can deduplicate
This would be interesting functionality to get back into Flume. If you
can / want to contribute it back in the form of a custom source, feel
free to open a JIRA so others can help / watch progress.
On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
> Hi :) !
> I have a use case where I want to keep a historical record of all the
> changes (insert/update/delete) happening on a MySQL DB.
> I am able to tail the bin logs and record them in HDFS, but they are not
> easy to parse because one operation is split on many lines. There are some
> comments that include the timestamp, the total time it took to execute the
> query and other stuff. A lot of this extra info is not relevant, but the
> timestamp is important for me, and I thought I might as well keep the rest
> of the info as well since the raw data gives me the option of going back to
> look for these other fields if I determine later on that I need them.
> Now, the fact that it's split over many lines makes it harder to use with
> I have thought of using a custom M/R RecordReader but I still have the
> problem that some of the lines related to one operation will be at the end
> of one HDFS file and the rest will be at the beginning of the next HDFS
> file, since I am opening and closing those files at an arbitrary roll time.
> I think the easiest way would be to do some minimal ETL at the source. I
> think I could use a custom decorator for this. Basically, that decorator
> would group together on a single line all the bin log lines that relate to a
> single DB operation. The original lines would be separated by semi-colons or
> some other character in the final output.
> I wanted to check with you guys to see if that approach made sense. If you
> have better suggestions, then I'm all ears, of course. Also, if you think
> there is an easier way than reading the bin logs to accomplish my original
> goal, then I'd like to hear about it as well :)
> Thanks :) !