flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Rodriguez <df.rodriguez...@gmail.com>
Subject Directory with avro files to HDFS
Date Fri, 07 Feb 2014 02:51:33 GMT
Hi all,

I have users writing AVRO files in different server and I want to use Flume
to move all those files into HDFS using Flume. So I can later use Hive or
Pig to query/analyse the data.

On the client I installed flume and have a SpoolDir source and AVRO sink
like this:

a1.sources = src1
a1.sinks = sink1
a1.channels = c1

a1.channels.c1.type = memory

a1.sources.src1.type = spooldir
a1.sources.src1.channels = c1
a1.sources.src1.spoolDir = {directory}
a1.sources.src1.fileHeader = true
a1.sources.src1.deserializer = avro

a1.sinks.sink1.type = avro
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hostname = {IP}
a1.sinks.sink1.port = 41414

On the hadoop cluster I have this AVRO source and HDFS sink:

a1.sources = avro1
a1.sinks = sink1
a1.channels = c1

a1.channels.c1.type = memory

a1.sources.avro1.type = avro
a1.sources.avro1.channels = c1
a1.sources.avro1.bind =
a1.sources.avro1.port = 41414

a1.sinks.sink1.type = hdfs
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hdfs.path = {hdfs dir}
a1.sinks.sink1.hdfs.fileSuffix = .avro
a1.sinks.sink1.hdfs.rollSize = 67108864
a1.sinks.sink1.hdfs.fileType = DataStream

The problem is that the files on HDFS are not valid AVRO files! I am using
the hue UI to check whenever the file is a valid AVRO file or not. If I
upload an AVRO I file that I generate on my pc to the cluster I can see its
contents perfectly, even create a Hive table and query but the files I send
via flume are not valid AVRO files.

I tried the flume avro client that is included in flume but didn't work
because it sends a flume event per line breaking the avro files, so i fixed
that using the spooldir source using deserializer = avro. So I think the
problem is on the HDFS sink when is writing the files.

Using hdfs.fileType = DataStream it writes the values from the avro fields
not the whole avro file, losing all the schema information. If I use
= SequenceFile the files are not valid for some reason.

I appreciate any help.



View raw message