hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <pjayachand...@hortonworks.com>
Subject Re: ORC NPE while writing stats
Date Thu, 03 Sep 2015 02:11:27 GMT
Memory manager is made thread local
https://issues.apache.org/jira/browse/HIVE-10191

Can you try the patch from HIVE-10191 and see if that helps?

On Sep 2, 2015, at 8:58 PM, David Capwell <dcapwell@gmail.com<mailto:dcapwell@gmail.com>>
wrote:


I'll try that out and see if it goes away (not seen this in the past 24 hours, no code change).

Doing this now means that I can't share the memory, so will prob go with a thread local and
allocate fixed sizes to the pool per thread (50% heap / 50 threads).  Will most likely be
awhile before I can report back (unless it fails fast in testing)

On Sep 2, 2015 2:11 PM, "Owen O'Malley" <omalley@apache.org<mailto:omalley@apache.org>>
wrote:
(Dropping dev)

Well, that explains the non-determinism, because the MemoryManager will be shared across threads
and thus the stripes will get flushed at effectively random times.

Can you try giving each writer a unique MemoryManager? You'll need to put a class into the
org.apache.hadoop.hive.ql.io.orc package to get access to the necessary class (MemoryManager)
and method (OrcFile.WriterOptions.memory). We may be missing a synchronization on the MemoryManager
somewhere and thus be getting a race condition.

Thanks,
   Owen

On Wed, Sep 2, 2015 at 12:57 PM, David Capwell <dcapwell@gmail.com<mailto:dcapwell@gmail.com>>
wrote:

We have multiple threads writing, but each thread works on one file, so orc writer is only
touched by one thread (never cross threads)

On Sep 2, 2015 11:18 AM, "Owen O'Malley" <omalley@apache.org<mailto:omalley@apache.org>>
wrote:
I don't see how it would get there. That implies that minimum was null, but the count was
non-zero.

The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:


@Override
OrcProto.ColumnStatistics.Builder serialize() {
  OrcProto.ColumnStatistics.Builder result = super.serialize();
  OrcProto.StringStatistics.Builder str =
    OrcProto.StringStatistics.newBuilder();
  if (getNumberOfValues() != 0) {
    str.setMinimum(getMinimum());
    str.setMaximum(getMaximum());
    str.setSum(sum);
  }
  result.setStringStatistics(str);
  return result;
}


and thus shouldn't call down to setMinimum unless it had at least some non-null values in
the column.

Do you have multiple threads working? There isn't anything that should be introducing non-determinism
so for the same input it would fail at the same point.

.. Owen



On Tue, Sep 1, 2015 at 10:51 PM, David Capwell <dcapwell@gmail.com<mailto:dcapwell@gmail.com>>
wrote:
We are writing ORC files in our application for hive to consume.
Given enough time, we have noticed that writing causes a NPE when
working with a string column's stats.  Not sure whats causing it on
our side yet since replaying the same data is just fine, it seems more
like this just happens over time (different data sources will hit this
around the same time in the same JVM).

Here is the code in question, and below is the exception:

final Writer writer = OrcFile.createWriter(path,
OrcFile.writerOptions(conf).inspector(oi));
try {
for (Data row : rows) {
   List<Object> struct = Orc.struct(row, inspector);
   writer.addRow(struct);
}
} finally {
   writer.close();
}


Here is the exception:

java.lang.NullPointerException: null
        at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
~[hive-exec-0.14.0.jar:0.14.0]
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
~[hive-exec-0.14.0.jar:


Versions:

Hadoop: apache 2.2.0
Hive Apache: 0.14.0
Java 1.7


Thanks for your time reading this email.




Mime
View raw message