Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB929183AA for ; Fri, 22 Jan 2016 18:38:17 +0000 (UTC) Received: (qmail 33367 invoked by uid 500); 22 Jan 2016 18:38:16 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 33296 invoked by uid 500); 22 Jan 2016 18:38:16 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 33286 invoked by uid 99); 22 Jan 2016 18:38:16 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Jan 2016 18:38:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CAA59C0D58 for ; Fri, 22 Jan 2016 18:38:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.02 X-Spam-Level: X-Spam-Status: No, score=-0.02 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=jwplayer-com.20150623.gappssmtp.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id r9T8c6wjigc0 for ; Fri, 22 Jan 2016 18:38:04 +0000 (UTC) Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 3748221195 for ; Fri, 22 Jan 2016 18:38:04 +0000 (UTC) Received: by mail-wm0-f47.google.com with SMTP id u188so29337727wmu.1 for ; Fri, 22 Jan 2016 10:38:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jwplayer-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=P5i45ZoY4AZUFvxb3IbQgcJd60mt7G9Z+3JQOtJOVoc=; b=WeC/2vXEtRrjXuB3EUVcHROge3g3R0AhUixzz29aorKdr1eogO5tV5iv49dwDFwj6d umGfvSDYtekxxbNusIi1oPEZdPstK8GalFJZG5MzvuLK0p5zYlDPbcduUkNUWsHQSEDe +E8PjI5YO+ONdSYopZIO4P8C81RCm5tqFHtHrdGEgb6iSQLyY6gFeK8ABlmG0EaDjgur Jl39t5K0djWgWmBxbuZtzMnk27eUkv1IlRA2FMzRRzUPdmigxH4GJ3G7UeyLB6ZJ1VA/ 8un29SqdkJjmRGP8hoiudBJ5/uR82E31j9+k6z4ZEWrVpkv0lRelpjHwi3ZwRCZF4aIp SJ8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=P5i45ZoY4AZUFvxb3IbQgcJd60mt7G9Z+3JQOtJOVoc=; b=CMGKOtJ9s3NxH+LsM+NEx9Hw+1GkFpdQf7WcA/OW98fRwlpXZfxeLJDA2fC6mIOeJD 8Y3oAs7bl/lYh2F9suOIzkxYXf6mSC6DEQn7hN2tTwQGrEQwBwcuAH7DTjIXKz40mV4h BTEvbWKgabd7JifHwsprwg4VyaF1pzK37nqUUMdJ68vLTMin4A+0Dvb7duSS4R8tRWOJ 3V8aQ++cVfRaMXO4VT+uCpNNd2YznpF+U1J8APr4Ga8iflPrum90rf1zCrwGYqe5DvRq B1Bx2dXlRFDvblWBsDR3ZaM0Sdm+OLGCwlGz8lEL2g4aQJb2YzlxjyX60ls8gEi01lKe rJcA== X-Gm-Message-State: AG10YOTB5oOomxflEsDRkGx9noobtXzB4jbeE7KAF0Rtx816OZDWU2NHh6apV3Nw32FGCVkOT0EnvRI2Pz0B9A== MIME-Version: 1.0 X-Received: by 10.28.229.20 with SMTP id c20mr4830648wmh.79.1453487883968; Fri, 22 Jan 2016 10:38:03 -0800 (PST) Received: by 10.27.201.215 with HTTP; Fri, 22 Jan 2016 10:38:03 -0800 (PST) In-Reply-To: References: Date: Fri, 22 Jan 2016 13:38:03 -0500 Message-ID: Subject: Fwd: Unable to create external table on large S3 bucket using S3A From: Rik Heijdens To: user@hive.apache.org Content-Type: text/plain; charset=UTF-8 Hi, I am trying to create an external table on a S3 bucket, however I'm receiving the following error in the process: hive> CREATE EXTERNAL TABLE ping_prod > PARTITIONED BY(day string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 's3a://path-to-bucket.com/data/' > TBLPROPERTIES ( > 'avro.schema.url'='s3a://path-to-bucket.com/avro/schema.avsc'); FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to sanitize XML document destined for handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK) See: https://gist.github.com/RikHeijdens/582371b5e6d24abc7471 for a complete stacktrace. This S3 bucket is very large (at least 800 TB), and contains about 400 directories with Avro serialized data. Each directory contains a day worth of data. I figured out that this might be because the size of the bucket, and the amount of files in this, so I tried to create an external table on a subset (1 day) of the data. That worked fine, and didn't cause any problems. I was wondering if this is a known issue, and why this is happening? I think it's an out of memory error, if that's the case, why would Hive need so much memory to create an external table? Also are there any workarounds for this problem? I'm running HDP-2.3.4.0-3485 and I am using the following Hive version: [root@docker-ambari tmp]# hive --version WARNING: Use "yarn jar" to launch YARN applications. Hive 1.2.1.2.3.4.0-3485 Subversion git://c66-slave-20176e25-2/grid/0/jenkins/workspace/HDP-build-centos6/bigtop/build/hive/rpm/BUILD/hive-1.2.1.2.3.4.0 -r efb067075854961dfa41165d5802a62ae334a2db Compiled by jenkins on Wed Dec 16 04:01:39 UTC 2015 >From source with checksum 4ecc763ed826fd070121da702cbd17e9 Thanks, Rik