Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6E8718058 for ; Wed, 2 Sep 2015 18:18:38 +0000 (UTC) Received: (qmail 78929 invoked by uid 500); 2 Sep 2015 18:18:33 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 78849 invoked by uid 500); 2 Sep 2015 18:18:33 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 78822 invoked by uid 99); 2 Sep 2015 18:18:33 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Sep 2015 18:18:33 +0000 Received: from mail-io0-f181.google.com (mail-io0-f181.google.com [209.85.223.181]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id B750B1A0040; Wed, 2 Sep 2015 18:18:32 +0000 (UTC) Received: by iofb144 with SMTP id b144so30358652iof.1; Wed, 02 Sep 2015 11:18:32 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.107.130.153 with SMTP id m25mr34852125ioi.192.1441217912234; Wed, 02 Sep 2015 11:18:32 -0700 (PDT) Received: by 10.79.73.135 with HTTP; Wed, 2 Sep 2015 11:18:32 -0700 (PDT) In-Reply-To: References: Date: Wed, 2 Sep 2015 11:18:32 -0700 Message-ID: Subject: Re: ORC NPE while writing stats From: "Owen O'Malley" To: "user@hive.apache.org" Cc: "dev@hive.apache.org" Content-Type: multipart/alternative; boundary=001a113fc24cb3ee60051ec7b137 --001a113fc24cb3ee60051ec7b137 Content-Type: text/plain; charset=UTF-8 I don't see how it would get there. That implies that minimum was null, but the count was non-zero. The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: @Override OrcProto.ColumnStatistics.Builder serialize() { OrcProto.ColumnStatistics.Builder result = super.serialize(); OrcProto.StringStatistics.Builder str = OrcProto.StringStatistics.newBuilder(); if (getNumberOfValues() != 0) { str.setMinimum(getMinimum()); str.setMaximum(getMaximum()); str.setSum(sum); } result.setStringStatistics(str); return result; } and thus shouldn't call down to setMinimum unless it had at least some non-null values in the column. Do you have multiple threads working? There isn't anything that should be introducing non-determinism so for the same input it would fail at the same point. .. Owen On Tue, Sep 1, 2015 at 10:51 PM, David Capwell wrote: > We are writing ORC files in our application for hive to consume. > Given enough time, we have noticed that writing causes a NPE when > working with a string column's stats. Not sure whats causing it on > our side yet since replaying the same data is just fine, it seems more > like this just happens over time (different data sources will hit this > around the same time in the same JVM). > > Here is the code in question, and below is the exception: > > final Writer writer = OrcFile.createWriter(path, > OrcFile.writerOptions(conf).inspector(oi)); > try { > for (Data row : rows) { > List struct = Orc.struct(row, inspector); > writer.addRow(struct); > } > } finally { > writer.close(); > } > > > Here is the exception: > > java.lang.NullPointerException: null > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) > ~[hive-exec-0.14.0.jar: > > > Versions: > > Hadoop: apache 2.2.0 > Hive Apache: 0.14.0 > Java 1.7 > > > Thanks for your time reading this email. > --001a113fc24cb3ee60051ec7b137--