flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Rodriguez <df.rodriguez...@gmail.com>
Subject Re: Directory with avro files to HDFS
Date Fri, 07 Feb 2014 15:19:50 GMT
Hi ed,

Thanks for your response. I was afraid that the solution was to write my own serializer, not
the most expert Java programmer :P

But I think that is the only solution, reading more at the docs:

This deserializer is able to read an Avro container file, and it generates one event per Avro
record in the file. Each event is annotated with a header that indicates the schema used.
The body of the event is the binary Avro record data, not including the schema or the rest
of the container file elements.

So I tested using deserializer.schemaType = LITERAL and I can see a JSON header with the schema
and on the body i can see the binary data of the values. So I think it should be “easy”
to write a serializer based on an example I found: https://github.com/brockn/avro-flume-hive-example/blob/master/src/main/java/com/cloudera/flume/serialization/FlumeEventStringBodyAvroEventSerializer.java

I was hoping that a General Avro serializer existed since there a deserializer that I am using
in the SpoolDir Source.

I will post if I came up with a solution,

Thanks

On Feb 6, 2014, at 9:10 PM, ed <edorsey@gmail.com> wrote:

> Hi Daniel,
> 
> I think you will need to write a custom event serializer for the HDFSSink that extends
AbstractAvroEventSerializer to write out your data using your specific Avro Schema.  Then
in your agent configuration add it like this:
> 
> a1.sinks.sink1.serializer = com.yourpackagename.CustomAvroEventSerializer$Builder
> 
> As a quick test you can use the default avro serializer  (https://flume.apache.org/FlumeUserGuide.html#avro-event-serializer)
like so:
> 
> a1.sinks.sink1.serializer = avro_event
> 
> I think this will end up just wrapping your avro data in Flume's default schema but at
least you can see if valid avro files are getting written to HDFS.  Hope that gets you a little
closer.
> 
> Best,
> 
> Ed
> 
> 
> On Fri, Feb 7, 2014 at 11:51 AM, Daniel Rodriguez <df.rodriguez143@gmail.com> wrote:
> Hi all,
> 
> I have users writing AVRO files in different server and I want to use Flume to move all
those files into HDFS using Flume. So I can later use Hive or Pig to query/analyse the data.
> 
> On the client I installed flume and have a SpoolDir source and AVRO sink like this:
> 
> 
> a1.sources = src1
> a1.sinks = sink1
> a1.channels = c1
> 
> a1.channels.c1.type = memory
> 
> a1.sources.src1.type = spooldir
> a1.sources.src1.channels = c1
> a1.sources.src1.spoolDir = {directory}
> a1.sources.src1.fileHeader = true
> a1.sources.src1.deserializer = avro
> 
> a1.sinks.sink1.type = avro
> a1.sinks.sink1.channel = c1
> a1.sinks.sink1.hostname = {IP}
> a1.sinks.sink1.port = 41414
> On the hadoop cluster I have this AVRO source and HDFS sink:
> 
> 
> a1.sources = avro1
> a1.sinks = sink1
> a1.channels = c1
> 
> a1.channels.c1.type = memory
> 
> a1.sources.avro1.type = avro
> a1.sources.avro1.channels = c1
> a1.sources.avro1.bind = 0.0.0.0
> a1.sources.avro1.port = 41414
> 
> a1.sinks.sink1.type = hdfs
> a1.sinks.sink1.channel = c1
> a1.sinks.sink1.hdfs.path = {hdfs dir}
> a1.sinks.sink1.hdfs.fileSuffix = .avro
> a1.sinks.sink1.hdfs.rollSize = 67108864
> a1.sinks.sink1.hdfs.fileType = DataStream
> The problem is that the files on HDFS are not valid AVRO files! I am using the hue UI
to check whenever the file is a valid AVRO file or not. If I upload an AVRO I file that I
generate on my pc to the cluster I can see its contents perfectly, even create a Hive table
and query but the files I send via flume are not valid AVRO files.
> 
> I tried the flume avro client that is included in flume but didn't work because it sends
a flume event per line breaking the avro files, so i fixed that using the spooldir source
using deserializer = avro. So I think the problem is on the HDFS sink when is writing the
files.
> 
> Using hdfs.fileType = DataStream it writes the values from the avro fields not the whole
avro file, losing all the schema information. If I use hdfs.fileType = SequenceFile the files
are not valid for some reason.
> 
> I appreciate any help.
> 
> Thanks,
> 
> Daniel
> 
> 


Mime
View raw message