hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Capwell <dcapw...@gmail.com>
Subject Re: ORC NPE while writing stats
Date Thu, 03 Sep 2015 04:47:58 GMT
Walking the MemoryManager, and I have a few questions:

# statements

Every time you create a writer for a given thread (assuming the thread
local version), you just update MemoryManager with the stripe size.
The scale is just %heap / (#writer * stripe (assuming equal stripe
size)).

Periodically ORC checks if the estimated amount of data >
stripe*scale. If so it flushes the stripe right away.  When the flush
happens, it checks to see how close it is to the end of a block and
scales the next stripe based off this.

# question assuming statements are correct

So, for me, I only have one writer per thread at any point in time, so
if MM is partitioned based off thread, then do I really care about the
% set for the pool size?  Since ORC appears to flush a stripe early,
wouldn't it make sense to figure out how many concurrent writers I
have, how much memory I want to allocate, then set the stripe size to
this?

So for 50 threads, and a stripe size of 64mb, 3,200mb would be
required? So, as long as I make sure the rest of my application gives
enough room for ORC, then I can just leave the value as default so it
just does stripe size...

So, if right, MM doesn't really do anything for me, so no issue
sharding and not configuring?


Thanks for your time reading this email!

On Wed, Sep 2, 2015 at 8:57 PM, David Capwell <dcapwell@gmail.com> wrote:
> So, very quickly looked at the JIRA and I had the following question;
> if you have a pool per thread rather than global, then assuming 50%
> heap will cause writer to OOM with multiple threads, which is
> different than older (0.14) ORC, correct?
>
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226
>
> So with orc.memory.pool=0.5, this value only seems to make sense if
> single threaded, so if you are writing with multiple threads, then I
> assume the value should be (0.5 / #threads), so if 50 threads then
> 0.01 should be the value?
>
> If this is true, I can't find any documentation about this, all docs
> make it sound global.
>
> On Wed, Sep 2, 2015 at 7:34 PM, David Capwell <dcapwell@gmail.com> wrote:
>> Thanks for the jira, will see if that works for us.
>>
>> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran"
>> <pjayachandran@hortonworks.com> wrote:
>>>
>>> Memory manager is made thread local
>>> https://issues.apache.org/jira/browse/HIVE-10191
>>>
>>> Can you try the patch from HIVE-10191 and see if that helps?
>>>
>>> On Sep 2, 2015, at 8:58 PM, David Capwell <dcapwell@gmail.com> wrote:
>>>
>>> I'll try that out and see if it goes away (not seen this in the past 24
>>> hours, no code change).
>>>
>>> Doing this now means that I can't share the memory, so will prob go with a
>>> thread local and allocate fixed sizes to the pool per thread (50% heap / 50
>>> threads).  Will most likely be awhile before I can report back (unless it
>>> fails fast in testing)
>>>
>>> On Sep 2, 2015 2:11 PM, "Owen O'Malley" <omalley@apache.org> wrote:
>>>>
>>>> (Dropping dev)
>>>>
>>>> Well, that explains the non-determinism, because the MemoryManager will
>>>> be shared across threads and thus the stripes will get flushed at
>>>> effectively random times.
>>>>
>>>> Can you try giving each writer a unique MemoryManager? You'll need to put
>>>> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
>>>> the necessary class (MemoryManager) and method
>>>> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
>>>> MemoryManager somewhere and thus be getting a race condition.
>>>>
>>>> Thanks,
>>>>    Owen
>>>>
>>>> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell <dcapwell@gmail.com>
>>>> wrote:
>>>>>
>>>>> We have multiple threads writing, but each thread works on one file,
so
>>>>> orc writer is only touched by one thread (never cross threads)
>>>>>
>>>>> On Sep 2, 2015 11:18 AM, "Owen O'Malley" <omalley@apache.org> wrote:
>>>>>>
>>>>>> I don't see how it would get there. That implies that minimum was
null,
>>>>>> but the count was non-zero.
>>>>>>
>>>>>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>>>>>>
>>>>>> @Override
>>>>>> OrcProto.ColumnStatistics.Builder serialize() {
>>>>>>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>>>>>>   OrcProto.StringStatistics.Builder str =
>>>>>>     OrcProto.StringStatistics.newBuilder();
>>>>>>   if (getNumberOfValues() != 0) {
>>>>>>     str.setMinimum(getMinimum());
>>>>>>     str.setMaximum(getMaximum());
>>>>>>     str.setSum(sum);
>>>>>>   }
>>>>>>   result.setStringStatistics(str);
>>>>>>   return result;
>>>>>> }
>>>>>>
>>>>>> and thus shouldn't call down to setMinimum unless it had at least
some
>>>>>> non-null values in the column.
>>>>>>
>>>>>> Do you have multiple threads working? There isn't anything that should
>>>>>> be introducing non-determinism so for the same input it would fail
at the
>>>>>> same point.
>>>>>>
>>>>>> .. Owen
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell <dcapwell@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> We are writing ORC files in our application for hive to consume.
>>>>>>> Given enough time, we have noticed that writing causes a NPE
when
>>>>>>> working with a string column's stats.  Not sure whats causing
it on
>>>>>>> our side yet since replaying the same data is just fine, it seems
more
>>>>>>> like this just happens over time (different data sources will
hit this
>>>>>>> around the same time in the same JVM).
>>>>>>>
>>>>>>> Here is the code in question, and below is the exception:
>>>>>>>
>>>>>>> final Writer writer = OrcFile.createWriter(path,
>>>>>>> OrcFile.writerOptions(conf).inspector(oi));
>>>>>>> try {
>>>>>>> for (Data row : rows) {
>>>>>>>    List<Object> struct = Orc.struct(row, inspector);
>>>>>>>    writer.addRow(struct);
>>>>>>> }
>>>>>>> } finally {
>>>>>>>    writer.close();
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> Here is the exception:
>>>>>>>
>>>>>>> java.lang.NullPointerException: null
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
>>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>>>>         at
>>>>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
>>>>>>> ~[hive-exec-0.14.0.jar:
>>>>>>>
>>>>>>>
>>>>>>> Versions:
>>>>>>>
>>>>>>> Hadoop: apache 2.2.0
>>>>>>> Hive Apache: 0.14.0
>>>>>>> Java 1.7
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your time reading this email.
>>>>>>
>>>>>>
>>>>
>>>
>>

Mime
View raw message