kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: where is kudu's dump core located?
Date Wed, 06 Apr 2016 07:18:08 GMT
I also put up a patch which should fix the issue here:
http://gerrit.cloudera.org:8080/#/c/2725/
If you're able to rebuild from source, give it a try. It should apply
cleanly on top of 0.7.1.

If not, let me know and I can send you a binary to test out.

-Todd

On Tue, Apr 5, 2016 at 11:21 PM, Todd Lipcon <todd@cloudera.com> wrote:

> BTW, I filed https://issues.apache.org/jira/browse/KUDU-1396 for this
> bug. Thanks for helping us track it down!
>
> On Tue, Apr 5, 2016 at 11:05 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> Hi Darren,
>>
>> Thanks again for the core. I got a chance to look at it, and it looks to
>> me like you have a value which is 58KB large which is causing the issue
>> here. In particular, what seems to have happened is that there is an UPDATE
>> delta which is 58KB, and we have a bug in our handling of index blocks when
>> a single record is larger than 32KB. The bug causes an infinite recursion
>> which blows out the stack and crashes with the scenario you saw (if you
>> print out the backtrace all the way to stack frame #81872 you can see the
>> original call to AppendDelta which starts the recursion).
>>
>> Amusingly, there is this debug-level assertion in the code:
>>
>>  size_t est_size = idx_block->EstimateEncodedSize();
>>   if (est_size > options_->index_block_size) {
>>     DCHECK(idx_block->Count() > 1)
>>       << "Index block full with only one entry - this would create "
>>       << "an infinite loop";
>>     // This index block is full, flush it.
>>     BlockPointer index_block_ptr;
>>     RETURN_NOT_OK(FinishBlockAndPropagate(level));
>>   }
>>
>> which I wrote way back in October 2012 about 3 weeks into Kudu's initial
>> development. Unfortunately it looks like we never went back to actually
>> address the problem, and in release builds, it causes a crash (rather than
>> an assertion failure in debug builds).
>>
>> I believe given this information we can easily reproduce and fix the
>> issue. Unfortunately it's probably too late for the 0.8.0 release, which is
>> already being voted upon. Do you think you would be able to build from
>> source? If not, we can probably provide you with a patched binary off of
>> trunk at some point if you want to help us verify the fix rather than wait
>> a couple months until the next release.
>>
>> -Todd
>>
>>
>>
>> On Tue, Apr 5, 2016 at 6:33 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> On Tue, Apr 5, 2016 at 6:27 PM, Darren Hoo <darren.hoo@gmail.com> wrote:
>>>
>>>> Thanks Todd,
>>>>
>>>> let me try giving a little more details here.
>>>>
>>>> When I first created the table and loaded about 100k records, kudu
>>>> tablet  server started to crash and very often.
>>>>
>>>> So I suspect that maybe the data file is corrupted and I dump the table
>>>> as parquet file ,
>>>> drop the table, recreate the table, and import the parquet file again.
>>>>
>>>> But after I did that, the tablet server still crashes often utill I
>>>> increase the memory limit to 16GB,
>>>> then the tablet server crashes less often, one time for serveral days.
>>>>
>>>> There's one big STRING column in my table, but the column should not be
>>>> bigger than 4k in size
>>>> as kudu document recommends.
>>>>
>>>
>>> OK, that's definitely an interesting part of the story. Although we
>>> think that 4k strings should be OK, the testing in this kind of workload
>>> has not been as extensive.
>>>
>>> If you are able to share the Parquet file and "create table" command for
>>> the dataset off-list, that would be great. I'll keep it only within our
>>> datacenter and delete it when done debugging.
>>>
>>>
>>>>
>>>> I will try to create a minmal dataset to reproduce the issue, but I am
>>>> not sure I can create one.
>>>>
>>>
>>> Thanks, that would be great if the larger dataset can't be shared.
>>>
>>>
>>>>
>>>> here's the core dump compressed,
>>>>
>>>> http://188.166.175.200/core.90197.bz2
>>>>
>>>> the exact kudu version is : 0.7.1-1.kudu0.7.1.p0.36   (installed from
>>>> parcel)
>>>>
>>>>
>>> OK, thank you. I"m downloading it now and will take a look tonight or
>>> tomorrow.
>>>
>>> -Todd
>>>
>>>
>>>> On Wed, Apr 6, 2016 at 8:59 AM, Todd Lipcon <todd@cloudera.com> wrote:
>>>>
>>>>> Hi Darren,
>>>>>
>>>>> This is interesting. I haven't seen a crash that looks like this, and
>>>>> not sure why it would cause data to disappear either.
>>>>>
>>>>> By any chance do you have some workload that can reproduce the issue?
>>>>> e.g. a particular data set that you are loading that seems to be causing
>>>>> problems?
>>>>>
>>>>> Maybe you can gzip the core file and send it to me off-list if it
>>>>> isn't too large?
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message