kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: where is kudu's dump core located?
Date Wed, 06 Apr 2016 06:21:57 GMT
BTW, I filed https://issues.apache.org/jira/browse/KUDU-1396 for this bug.
Thanks for helping us track it down!

On Tue, Apr 5, 2016 at 11:05 PM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Darren,
>
> Thanks again for the core. I got a chance to look at it, and it looks to
> me like you have a value which is 58KB large which is causing the issue
> here. In particular, what seems to have happened is that there is an UPDATE
> delta which is 58KB, and we have a bug in our handling of index blocks when
> a single record is larger than 32KB. The bug causes an infinite recursion
> which blows out the stack and crashes with the scenario you saw (if you
> print out the backtrace all the way to stack frame #81872 you can see the
> original call to AppendDelta which starts the recursion).
>
> Amusingly, there is this debug-level assertion in the code:
>
>  size_t est_size = idx_block->EstimateEncodedSize();
>   if (est_size > options_->index_block_size) {
>     DCHECK(idx_block->Count() > 1)
>       << "Index block full with only one entry - this would create "
>       << "an infinite loop";
>     // This index block is full, flush it.
>     BlockPointer index_block_ptr;
>     RETURN_NOT_OK(FinishBlockAndPropagate(level));
>   }
>
> which I wrote way back in October 2012 about 3 weeks into Kudu's initial
> development. Unfortunately it looks like we never went back to actually
> address the problem, and in release builds, it causes a crash (rather than
> an assertion failure in debug builds).
>
> I believe given this information we can easily reproduce and fix the
> issue. Unfortunately it's probably too late for the 0.8.0 release, which is
> already being voted upon. Do you think you would be able to build from
> source? If not, we can probably provide you with a patched binary off of
> trunk at some point if you want to help us verify the fix rather than wait
> a couple months until the next release.
>
> -Todd
>
>
>
> On Tue, Apr 5, 2016 at 6:33 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Tue, Apr 5, 2016 at 6:27 PM, Darren Hoo <darren.hoo@gmail.com> wrote:
>>
>>> Thanks Todd,
>>>
>>> let me try giving a little more details here.
>>>
>>> When I first created the table and loaded about 100k records, kudu
>>> tablet  server started to crash and very often.
>>>
>>> So I suspect that maybe the data file is corrupted and I dump the table
>>> as parquet file ,
>>> drop the table, recreate the table, and import the parquet file again.
>>>
>>> But after I did that, the tablet server still crashes often utill I
>>> increase the memory limit to 16GB,
>>> then the tablet server crashes less often, one time for serveral days.
>>>
>>> There's one big STRING column in my table, but the column should not be
>>> bigger than 4k in size
>>> as kudu document recommends.
>>>
>>
>> OK, that's definitely an interesting part of the story. Although we think
>> that 4k strings should be OK, the testing in this kind of workload has not
>> been as extensive.
>>
>> If you are able to share the Parquet file and "create table" command for
>> the dataset off-list, that would be great. I'll keep it only within our
>> datacenter and delete it when done debugging.
>>
>>
>>>
>>> I will try to create a minmal dataset to reproduce the issue, but I am
>>> not sure I can create one.
>>>
>>
>> Thanks, that would be great if the larger dataset can't be shared.
>>
>>
>>>
>>> here's the core dump compressed,
>>>
>>> http://188.166.175.200/core.90197.bz2
>>>
>>> the exact kudu version is : 0.7.1-1.kudu0.7.1.p0.36   (installed from
>>> parcel)
>>>
>>>
>> OK, thank you. I"m downloading it now and will take a look tonight or
>> tomorrow.
>>
>> -Todd
>>
>>
>>> On Wed, Apr 6, 2016 at 8:59 AM, Todd Lipcon <todd@cloudera.com> wrote:
>>>
>>>> Hi Darren,
>>>>
>>>> This is interesting. I haven't seen a crash that looks like this, and
>>>> not sure why it would cause data to disappear either.
>>>>
>>>> By any chance do you have some workload that can reproduce the issue?
>>>> e.g. a particular data set that you are loading that seems to be causing
>>>> problems?
>>>>
>>>> Maybe you can gzip the core file and send it to me off-list if it isn't
>>>> too large?
>>>>
>>>>
>>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message