flume-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alberto Sarubbi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLUME-2967) Corrupted gzip files generated when writting to S3
Date Fri, 05 Aug 2016 15:00:23 GMT
Alberto Sarubbi created FLUME-2967:
--------------------------------------

             Summary: Corrupted gzip files generated when writting to S3
                 Key: FLUME-2967
                 URL: https://issues.apache.org/jira/browse/FLUME-2967
             Project: Flume
          Issue Type: Question
          Components: Sinks+Sources
    Affects Versions: v1.6.0
         Environment: Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

Amazon Linux AMI release 2016.03
4.1.17-22.30.amzn1.x86_64
            Reporter: Alberto Sarubbi


a flume process configured with the following parameters writes corrupt gzip files to AWS
S3

h4. Configuration
{noformat}
#### SINKS ####
#sink to write to S3
a1.sinks.khdfs.type = hdfs
a1.sinks.khdfs.hdfs.path = s3n://AKIAJNPLYD4CT4MCXNDA:MTlCKjdW3CiQ8PrKKDXwLIQaZVgLYM9OzmwTSJ1t@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
a1.sinks.khdfs.hdfs.fileType = CompressedStream
a1.sinks.khdfs.hdfs.codeC = gzip
a1.sinks.khdfs.hdfs.filePrefix = useractivity
a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
a1.sinks.khdfs.hdfs.writeFormat = Writable
a1.sinks.khdfs.hdfs.rollCount = 100
a1.sinks.khdfs.hdfs.rollSize = 0
a1.sinks.khdfs.hdfs.callTimeout = 120000
a1.sinks.khdfs.hdfs.batchSize = 1000
a1.sinks.khdfs.hdfs.threadsPoolSize = 40
a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
a1.sinks.khdfs.channel = chdfs
{noformat}

the input is a simple JSON structure
{code:javascript}
{
  "origin": "Mi Tigo App sv",
  "date": "2016-08-05T14:26:10.859Z",
  "country": "SV",
  "action": "MI-TIGO-APP Header Enrichment",
  "msisdn": "76821107",
  "ip": "181.189.178.89",
  "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 (1.503.0.73)",
  "data": {
    "variables": "{\"!msisdn\":\"76821107\"}"
  },
  "event_id": "mta_login"
}
{code}

i use a combination of hdfs sink and the following libraries in the plugins.d/hdfs/libext
folder

{noformat}
  hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', version: '2.5.2'
  hdfs group: 'commons-configuration', name: 'commons-configuration', version: '1.10'
  hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
  hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
  hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
{noformat}

i expect a file with 100 events and compressed in gzip format to be on S3, but the generated
file is damaged: 
* the size of the compressed size is greater than the internal file
* most tools fails to decompress the file, arguing is damaged.
* gzip -d forcefully decompresses, not without complaining about extra 
 trailing garbage
{noformat}
gzip -d useractivity.1470407170478.json.gz 
gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage ignored
{noformat}

* last but not least, the resulting file from the forced decompression contains only one or
two lines, where 100 is expected.

h4. we tried (to no avail) :
* both Writable and Text file types
* all options on controlling the file content by rolling: time, events, size
* all combinations of recipes for writing to S3, including more than one set of libraries
* all schemas (s3n, s3a)
* not compressing. this generates the expected json files just fine.
* vanilla flume libraries
* heavily replaced flume libraries, with newer or different versions of libraries (just in
case)
* read all available documentation

h4. we haven't tried:
* install Hadoop and refer libraries in classpath (we want to avoid this, we are not using
Hadoop in the Flume nodes)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message