kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: where is kudu's dump core located?
Date Wed, 06 Apr 2016 23:34:18 GMT
On Wed, Apr 6, 2016 at 12:53 AM, Darren Hoo <darren.hoo@gmail.com> wrote:

> Todd,
>
> Thanks a lot for such a quick response and  fix!
>
> I have some trouble with setting up the build environment for now and
> don't have  time to look throught the documentation.
>

No problem. If you do get some time and want to get a build environment
going, feel free to ping the list again and we can help if you run into any
issues. Or, you can stop by our slack channel for more real-time
troubleshooting. (I seem to recall you are in China, and there are several
developers in your time zone who can probably help).


>
> so can you send me the binary or where can I download it?  I'd very much
> appreciate that.
>
>
Sure, here is a tserver binary you can try:
http://cloudera-kudu-beta.s3.amazonaws.com/2016-04-06-for-darren-hoo/kudu-tserver.gz

I built this for el6 from the 0.7.1 tag plus cherry-picking the bug fix for
your issue. You should be able to drop this in as a replacement on top of
your existing kudu-tserver binary. I'd recommend just moving aside the old
one to 'kudu-tserver.orig' just in case you run into any issues with it. Of
course you'll need to do this on every tserver node in your cluster.

-Todd


>
> On Wed, Apr 6, 2016 at 3:18 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> I also put up a patch which should fix the issue here:
>> http://gerrit.cloudera.org:8080/#/c/2725/
>> If you're able to rebuild from source, give it a try. It should apply
>> cleanly on top of 0.7.1.
>>
>> If not, let me know and I can send you a binary to test out.
>>
>> -Todd
>>
>> On Tue, Apr 5, 2016 at 11:21 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> BTW, I filed https://issues.apache.org/jira/browse/KUDU-1396 for this
>>> bug. Thanks for helping us track it down!
>>>
>>> On Tue, Apr 5, 2016 at 11:05 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>
>>>> Hi Darren,
>>>>
>>>> Thanks again for the core. I got a chance to look at it, and it looks
>>>> to me like you have a value which is 58KB large which is causing the issue
>>>> here. In particular, what seems to have happened is that there is an UPDATE
>>>> delta which is 58KB, and we have a bug in our handling of index blocks when
>>>> a single record is larger than 32KB. The bug causes an infinite recursion
>>>> which blows out the stack and crashes with the scenario you saw (if you
>>>> print out the backtrace all the way to stack frame #81872 you can see the
>>>> original call to AppendDelta which starts the recursion).
>>>>
>>>> Amusingly, there is this debug-level assertion in the code:
>>>>
>>>>  size_t est_size = idx_block->EstimateEncodedSize();
>>>>   if (est_size > options_->index_block_size) {
>>>>     DCHECK(idx_block->Count() > 1)
>>>>       << "Index block full with only one entry - this would create
"
>>>>       << "an infinite loop";
>>>>     // This index block is full, flush it.
>>>>     BlockPointer index_block_ptr;
>>>>     RETURN_NOT_OK(FinishBlockAndPropagate(level));
>>>>   }
>>>>
>>>> which I wrote way back in October 2012 about 3 weeks into Kudu's
>>>> initial development. Unfortunately it looks like we never went back to
>>>> actually address the problem, and in release builds, it causes a crash
>>>> (rather than an assertion failure in debug builds).
>>>>
>>>> I believe given this information we can easily reproduce and fix the
>>>> issue. Unfortunately it's probably too late for the 0.8.0 release, which
is
>>>> already being voted upon. Do you think you would be able to build from
>>>> source? If not, we can probably provide you with a patched binary off of
>>>> trunk at some point if you want to help us verify the fix rather than wait
>>>> a couple months until the next release.
>>>>
>>>> -Todd
>>>>
>>>>
>>>>
>>>> On Tue, Apr 5, 2016 at 6:33 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>>
>>>>> On Tue, Apr 5, 2016 at 6:27 PM, Darren Hoo <darren.hoo@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Todd,
>>>>>>
>>>>>> let me try giving a little more details here.
>>>>>>
>>>>>> When I first created the table and loaded about 100k records, kudu
>>>>>> tablet  server started to crash and very often.
>>>>>>
>>>>>> So I suspect that maybe the data file is corrupted and I dump the
>>>>>> table as parquet file ,
>>>>>> drop the table, recreate the table, and import the parquet file again.
>>>>>>
>>>>>> But after I did that, the tablet server still crashes often utill
I
>>>>>> increase the memory limit to 16GB,
>>>>>> then the tablet server crashes less often, one time for serveral
days.
>>>>>>
>>>>>> There's one big STRING column in my table, but the column should
not
>>>>>> be bigger than 4k in size
>>>>>> as kudu document recommends.
>>>>>>
>>>>>
>>>>> OK, that's definitely an interesting part of the story. Although we
>>>>> think that 4k strings should be OK, the testing in this kind of workload
>>>>> has not been as extensive.
>>>>>
>>>>> If you are able to share the Parquet file and "create table" command
>>>>> for the dataset off-list, that would be great. I'll keep it only within
our
>>>>> datacenter and delete it when done debugging.
>>>>>
>>>>>
>>>>>>
>>>>>> I will try to create a minmal dataset to reproduce the issue, but
I
>>>>>> am not sure I can create one.
>>>>>>
>>>>>
>>>>> Thanks, that would be great if the larger dataset can't be shared.
>>>>>
>>>>>
>>>>>>
>>>>>> here's the core dump compressed,
>>>>>>
>>>>>> http://188.166.175.200/core.90197.bz2
>>>>>>
>>>>>> the exact kudu version is : 0.7.1-1.kudu0.7.1.p0.36   (installed
>>>>>> from parcel)
>>>>>>
>>>>>>
>>>>> OK, thank you. I"m downloading it now and will take a look tonight or
>>>>> tomorrow.
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>> On Wed, Apr 6, 2016 at 8:59 AM, Todd Lipcon <todd@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Darren,
>>>>>>>
>>>>>>> This is interesting. I haven't seen a crash that looks like this,
>>>>>>> and not sure why it would cause data to disappear either.
>>>>>>>
>>>>>>> By any chance do you have some workload that can reproduce the
>>>>>>> issue? e.g. a particular data set that you are loading that seems
to be
>>>>>>> causing problems?
>>>>>>>
>>>>>>> Maybe you can gzip the core file and send it to me off-list if
it
>>>>>>> isn't too large?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message