flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DSuiter RDX <dsui...@rdx.com>
Subject Problem aggregating syslogTCP > avro > HDFS
Date Mon, 07 Oct 2013 16:00:14 GMT
Hi, this may be a problem with our understanding, or my configuration.

I am trying to take data from rsyslog via remote forwarding over TCP into a
syslogTCP source, collect it as an avro sink, connect the avro sink to an
avro source, and then into an HDFS sink.

Everything is connected and the data is flowing from the remote source into
HDFS in an avro container, so that is not the problem.

The problem is that it is closing files when they are very small, only KBs
in size, even though I have the hdfs roll_Interval and rollCount properties
set to 0. I set the hdfs.rollSize property to 3072 for 3MB. I expected it
to aggregate the files into larger blocks before closing them. Is this
happening because of the HDFS directory-building escape sequences forcing
new directory writes and making new files prematurely?

Here are my agent configs:

syslogTCP Source > Avro Sink (first tier, pretty sure everything is ok here
but maybe not)

####RT Listener Agent####
rtlv1.sources=srclv1
rtlv1.sinks=snklv1
rtlv1.channels=chnlv1

#sources
rtlv1.sources.srclv1.type=syslogtcp
rtlv1.sources.srclv1.host=192.168.1.2
rtlv1.sources.srclv1.port=5140
rtlv1.sources.srclv1.channels=chnlv1

#channels
rtlv1.channels.chnlv1.type=memory
rtlv1.channels.chnlv1.capacity=1500
rtlv1.channels.chnlv1.transactionCapacity=1500

#sinks
rtlv1.sinks.snklv1.type=avro
rtlv1.sinks.snklv1.hostname=192.168.1.2
rtlv1.sinks.snklv1.port=5141
rtlv1.sinks.snklv1.batch-size=1500
rtlv1.sinks.snklv1.channel=chnlv1

Avro Source > HDFS (second tier)

####RT Aggregate Writer Agent####
rtlv2.sources=srclv2
rtlv2.sinks=snklv2
rtlv2.channels=chnlv2

#sources
rtlv2.sources.srclv2.type=avro
rtlv2.sources.srclv2.bind=192.168.1.2
rtlv2.sources.srclv2.port=5141
rtlv2.sources.srclv2.channels=chnlv2

#channels
rtlv2.channels.chnlv2.type=memory
rtlv2.channels.chnlv2.capacity=1500
rtlv2.channels.chnlv2.transactioncapacity=1500

#sinks
rtlv2.sinks.snklv2.type=hdfs
rtlv2.sinks.snklv2.channel=chnlv2
rtlv2.sinks.snklv2.hdfs.path=/user/flume/avro/%y-%m-%d/%H%M
rtlv2.sinks.snklv2.hdfs.fileSuffix=.avro
rtlv2.sinks.snklv2.serializer=avro_event
rtlv2.sinks.snklv2.hdfs.fileType=DataStream
rtlv2.sinks.snklv2.hdfs.rollInterval=0
rtlv2.sinks.snklv2.hdfs.rollSize=3072
rtlv2.sinks.snklv2.hdfs.batchSize=1500
rtlv2.sinks.snklv2.hdfs.rollCount=0
rtlv2.sinks.snklv2.hdfs.round=true
rtlv2.sinks.snklv2.hdfs.roundValue=10
rtlv2.sinks.snklv2.hdfs.roundUnit=minute

Thanks!
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

Mime
View raw message