flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy <jimmyj...@gmail.com>
Subject Re: Handling malformed data when using custom AvroEventSerializer and HDFS Sink
Date Thu, 02 Jan 2014 16:10:01 GMT
We are doing similar thing what Brock mentioned - simple interceptor for
JSON validation with updating custom field in the header, then flume HDFS
sink pushes the data to good/bad target directory based on this custom
field.... then watch for bad directory in separate process.

You could add notification to the flume flow, we wanted to keep it very
simple.




---------- Forwarded message ----------
From: Devin Suiter RDX <dsuiter@rdx.com>
Date: Thu, Jan 2, 2014 at 7:40 AM
Subject: Re: Handling malformed data when using custom AvroEventSerializer
and HDFS Sink
To: user@flume.apache.org


Just throwing this out there, since I haven't had time to dig into the API
with a big fork, but, can morphlines offer any assistance here?

Some kind of an interceptor that would parse for malformed data, package
the offending data and send it somewhere (email it, log it), and then
project a valid "there was something wrong here" piece of data into the
field then allow your channel to carry on? Or skip the projection piece and
just move along? I was just thinking that the projection of known data into
a field that previously had malformed data would allow you to easily locate
those records later with the projected data, but keep your data shape
consistent.

Kind of looking to Brock a

s a sounding board as to the appropriateness of this as a potential
solution since morphlines takes some time to really understand well...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Thu, Jan 2, 2014 at 10:25 AM, Brock Noland <brock@cloudera.com> wrote:

>
> On Tue, Dec 31, 2013 at 8:34 PM, ed <edorsey@gmail.com> wrote:
>
>> Hello,
>>
>> We are using Flume v1.4 to load JSON formatted log data into HDFS as
>> Avro.  Our flume setup looks like this:
>>
>> NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer)
>>
>> Right now our custom EventSerializer (which extends
>> AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and
>> converts it into an avro record of the appropriate type for the incoming
>> log file.  This is working great and we use the serializer to add some
>> additional "synthetic" fields to the avro record that don't exist in the
>> original JSON log data.
>>
>> My question concerns how to handle malformed JSON data (or really any
>> error inside of the custom EventSerializer).  It's very likely that as we
>> parse the JSON there will be records where something is malformed (either
>> the JSON itself, or a field is of the wrong type etc.).
>>
>> For example, a "port" field which should always be an Integer might for
>> some reason have some ASCII text in it.  I'd like to catch these errors in
>> the EventSerializer and then write out the bad JSON to a log file somewhere
>> that we can monitor.
>>
>
> Yeah it would be nice to have a better story about this in Flume.
>
>
>>
>> What is the best way to do this?
>>
>
> Typically people will either log it to a file or send it through another
> "flow" to a different HDFS sink.
>
>
>
>> Right now, all the logic for catching bad JSON would be inside of the
>> "convert" function of the EventSerializer.  Should the convert function
>> itself throw an exception that will be gracefully handled upstream
>>
>
> The exception will be logged but that is it..
>
>
>> or do I just return a "null" value if there was an error?  Would it be
>> appropriate to log errors directly to a database from inside the
>> EventSerializer convert method or would this be too slow?
>>
>
> That might be too slow to do directly. If I did that I'd have a separate
> thread doing that and then an in-memory queue between the serializer and
> thread.
>
>
>> What are the best practices for this type of error handling?
>>
>
> If looks to me like we'd need to change AbstractAvroEventSerilizer to
> filter out nulls:
>
>
> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/AbstractAvroEventSerializer.java#L106
>
> which we could easily do.  Since you don't want to wait for that you could
> override the write method to do this.
>
>
>>
>> Thank you for any assistance!
>>
>> Best Regards,
>>
>> Ed
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Mime
View raw message