drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5871) Large files fail to write to s3 datastore using hdfs s3a.
Date Fri, 13 Oct 2017 18:50:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204023#comment-16204023

Steve Loughran commented on DRILL-5871:

Upgrade to Hadoop 2.8.2 and this will go away. Interesting you are the only other person to
have reported this though. I can see it, you can see it: nobody else, even though we have
scale test cases which we use to push up GB of data.

Otherwise, stack trace implies you are on 2.7; move up to 2.8.1 and then set {{fs.s3a.multipart.size}}
to, say, 4G and see if that is enough to briefly suspend splitting. That threshold param is
only used by the AWS transfer manager; in the block output stream we only use the multipart.size

bq. The s3a driver does not appear to have an option to disable multi-part uploads all together.

No, as it's the only way to upload a file > 4GB. You can set minimum size for a block,
but it's needed above a certain point. There is a limit to the amount of data which can be
sent in a single HTTP PUT request

> Large files fail to write to s3 datastore using hdfs s3a.
> ---------------------------------------------------------
>                 Key: DRILL-5871
>                 URL: https://issues.apache.org/jira/browse/DRILL-5871
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Other, Storage - Text & CSV
>    Affects Versions: 1.11.0
>         Environment: Centos 7.4, Oracle Java SE 1.80.0_131-b11, x86_64, vmware. Zookeeper
cluster, two drillbits, 3 zookeepers.
>            Reporter: Steve Jacobs
>              Labels: csv, hdfs, s3
> When storing CSV files to a S3a storage driver using a CTAS, if the files are large enough
to implicate the multi-part upload functionality, the CTAS fails with the following stack
trace (we can write smaller CSV's and parquet files no problem):
> Error: SYSTEM ERROR: UnsupportedOperationException
> Fragment 0:0
> [Error Id: dbb018ea-29eb-4e1a-bf97-4c2c9cfbdf3c on den-certdrill-1.ci.neoninternal.org:31010]
>   (java.lang.UnsupportedOperationException) null
>     java.util.Collections$UnmodifiableList.sort():1331
>     java.util.Collections.sort():175
>     com.amazonaws.services.s3.model.transform.RequestXmlFactory.convertToXmlByteArray():42
>     com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload():2513
>     org.apache.hadoop.fs.s3a.S3AFastOutputStream$MultiPartUpload.complete():384
>     org.apache.hadoop.fs.s3a.S3AFastOutputStream.close():253
>     org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close():72
>     org.apache.hadoop.fs.FSDataOutputStream.close():106
>     java.io.PrintStream.close():360
>     org.apache.drill.exec.store.text.DrillTextRecordWriter.cleanup():170
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.closeWriter():184
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():128
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>     org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():133
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():105
>     org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():95
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():234
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():227
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():422
>     org.apache.hadoop.security.UserGroupInformation.doAs():1657
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():227
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>     java.lang.Thread.run():748 (state=,code=0)
> This looks suspiciously like:
> https://issues.apache.org/jira/browse/HADOOP-14204
> So the fix may be as 'simple' as just syncing the upstream version when Hadoop 2.8.2
releases later this month. Although I am ignorant to the implications of upgrading hadoop-hdfs
to this version.
> We are able to store smaller files just fine.
> Things I've tried:
> Setting fs.s3a.multipart.threshold to a ridiculously large value like 10T (these files
are just over 1GB). Does not work.
> Setting fs.s3a.fast.upload: false. Also does not change the behavior.
> The s3a driver does not appear to have an option to disable multi-part uploads all together.

> For completeness sake here are my current S3a options for the driver:
> "fs.s3a.endpoint": "******",
>     "fs.s3a.access.key": "*",
>     "fs.s3a.secret.key": "*",
>     "fs.s3a.connection.maximum": "200",
>     "fs.s3a.paging.maximum": "1000",
>     "fs.s3a.fast.upload": "true",
>     "fs.s3a.multipart.purge": "true",
>     "fs.s3a.fast.upload.buffer": "bytebuffer",
>     "fs.s3a.fast.upload.active.blocks": "8",
>     "fs.s3a.buffer.dir": "/opt/apache-airflow/buffer",
>     "fs.s3a.multipart.size": "134217728",
>     "fs.s3a.multipart.threshold": "671088640",
>     "fs.s3a.experimental.input.fadvise": "sequential",
>     "fs.s3a.acl.default": "PublicRead",
>     "fs.s3a.multiobjectdelete.enable": "true"

This message was sent by Atlassian JIRA

View raw message