flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: Handling malformed data when using custom AvroEventSerializer and HDFS Sink
Date Thu, 02 Jan 2014 15:40:06 GMT
Just throwing this out there, since I haven't had time to dig into the API
with a big fork, but, can morphlines offer any assistance here?

Some kind of an interceptor that would parse for malformed data, package
the offending data and send it somewhere (email it, log it), and then
project a valid "there was something wrong here" piece of data into the
field then allow your channel to carry on? Or skip the projection piece and
just move along? I was just thinking that the projection of known data into
a field that previously had malformed data would allow you to easily locate
those records later with the projected data, but keep your data shape
consistent.

Kind of looking to Brock as a sounding board as to the appropriateness of
this as a potential solution since morphlines takes some time to really
understand well...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Thu, Jan 2, 2014 at 10:25 AM, Brock Noland <brock@cloudera.com> wrote:

>
> On Tue, Dec 31, 2013 at 8:34 PM, ed <edorsey@gmail.com> wrote:
>
>> Hello,
>>
>> We are using Flume v1.4 to load JSON formatted log data into HDFS as
>> Avro.  Our flume setup looks like this:
>>
>> NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer)
>>
>> Right now our custom EventSerializer (which extends
>> AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and
>> converts it into an avro record of the appropriate type for the incoming
>> log file.  This is working great and we use the serializer to add some
>> additional "synthetic" fields to the avro record that don't exist in the
>> original JSON log data.
>>
>> My question concerns how to handle malformed JSON data (or really any
>> error inside of the custom EventSerializer).  It's very likely that as we
>> parse the JSON there will be records where something is malformed (either
>> the JSON itself, or a field is of the wrong type etc.).
>>
>> For example, a "port" field which should always be an Integer might for
>> some reason have some ASCII text in it.  I'd like to catch these errors in
>> the EventSerializer and then write out the bad JSON to a log file somewhere
>> that we can monitor.
>>
>
> Yeah it would be nice to have a better story about this in Flume.
>
>
>>
>> What is the best way to do this?
>>
>
> Typically people will either log it to a file or send it through another
> "flow" to a different HDFS sink.
>
>
>
>> Right now, all the logic for catching bad JSON would be inside of the
>> "convert" function of the EventSerializer.  Should the convert function
>> itself throw an exception that will be gracefully handled upstream
>>
>
> The exception will be logged but that is it..
>
>
>> or do I just return a "null" value if there was an error?  Would it be
>> appropriate to log errors directly to a database from inside the
>> EventSerializer convert method or would this be too slow?
>>
>
> That might be too slow to do directly. If I did that I'd have a separate
> thread doing that and then an in-memory queue between the serializer and
> thread.
>
>
>> What are the best practices for this type of error handling?
>>
>
> If looks to me like we'd need to change AbstractAvroEventSerilizer to
> filter out nulls:
>
>
> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/AbstractAvroEventSerializer.java#L106
>
> which we could easily do.  Since you don't want to wait for that you could
> override the write method to do this.
>
>
>>
>> Thank you for any assistance!
>>
>> Best Regards,
>>
>> Ed
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Mime
View raw message