kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Hoo <darren....@gmail.com>
Subject Re: where is kudu's dump core located?
Date Fri, 08 Apr 2016 01:53:06 GMT
I am already using your binary now, so far so good.
I'll report back if there were any futher crashes.

And yes, I am in Beijing, China.

Thanks again for your kind help!

On Thu, Apr 7, 2016 at 7:34 AM, Todd Lipcon <todd@cloudera.com> wrote:

> On Wed, Apr 6, 2016 at 12:53 AM, Darren Hoo <darren.hoo@gmail.com> wrote:
>
>> Todd,
>>
>> Thanks a lot for such a quick response and  fix!
>>
>> I have some trouble with setting up the build environment for now and
>> don't have  time to look throught the documentation.
>>
>
> No problem. If you do get some time and want to get a build environment
> going, feel free to ping the list again and we can help if you run into any
> issues. Or, you can stop by our slack channel for more real-time
> troubleshooting. (I seem to recall you are in China, and there are several
> developers in your time zone who can probably help).
>
>
>>
>> so can you send me the binary or where can I download it?  I'd very much
>> appreciate that.
>>
>>
> Sure, here is a tserver binary you can try:
>
> http://cloudera-kudu-beta.s3.amazonaws.com/2016-04-06-for-darren-hoo/kudu-tserver.gz
>
> I built this for el6 from the 0.7.1 tag plus cherry-picking the bug fix
> for your issue. You should be able to drop this in as a replacement on top
> of your existing kudu-tserver binary. I'd recommend just moving aside the
> old one to 'kudu-tserver.orig' just in case you run into any issues with
> it. Of course you'll need to do this on every tserver node in your cluster.
>
> -Todd
>
>
>>
>> On Wed, Apr 6, 2016 at 3:18 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> I also put up a patch which should fix the issue here:
>>> http://gerrit.cloudera.org:8080/#/c/2725/
>>> If you're able to rebuild from source, give it a try. It should apply
>>> cleanly on top of 0.7.1.
>>>
>>> If not, let me know and I can send you a binary to test out.
>>>
>>> -Todd
>>>
>>> On Tue, Apr 5, 2016 at 11:21 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>
>>>> BTW, I filed https://issues.apache.org/jira/browse/KUDU-1396 for this
>>>> bug. Thanks for helping us track it down!
>>>>
>>>> On Tue, Apr 5, 2016 at 11:05 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>>
>>>>> Hi Darren,
>>>>>
>>>>> Thanks again for the core. I got a chance to look at it, and it looks
>>>>> to me like you have a value which is 58KB large which is causing the
issue
>>>>> here. In particular, what seems to have happened is that there is an
UPDATE
>>>>> delta which is 58KB, and we have a bug in our handling of index blocks
when
>>>>> a single record is larger than 32KB. The bug causes an infinite recursion
>>>>> which blows out the stack and crashes with the scenario you saw (if you
>>>>> print out the backtrace all the way to stack frame #81872 you can see
the
>>>>> original call to AppendDelta which starts the recursion).
>>>>>
>>>>> Amusingly, there is this debug-level assertion in the code:
>>>>>
>>>>>  size_t est_size = idx_block->EstimateEncodedSize();
>>>>>   if (est_size > options_->index_block_size) {
>>>>>     DCHECK(idx_block->Count() > 1)
>>>>>       << "Index block full with only one entry - this would create
"
>>>>>       << "an infinite loop";
>>>>>     // This index block is full, flush it.
>>>>>     BlockPointer index_block_ptr;
>>>>>     RETURN_NOT_OK(FinishBlockAndPropagate(level));
>>>>>   }
>>>>>
>>>>> which I wrote way back in October 2012 about 3 weeks into Kudu's
>>>>> initial development. Unfortunately it looks like we never went back to
>>>>> actually address the problem, and in release builds, it causes a crash
>>>>> (rather than an assertion failure in debug builds).
>>>>>
>>>>> I believe given this information we can easily reproduce and fix the
>>>>> issue. Unfortunately it's probably too late for the 0.8.0 release, which
is
>>>>> already being voted upon. Do you think you would be able to build from
>>>>> source? If not, we can probably provide you with a patched binary off
of
>>>>> trunk at some point if you want to help us verify the fix rather than
wait
>>>>> a couple months until the next release.
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 5, 2016 at 6:33 PM, Todd Lipcon <todd@cloudera.com>
wrote:
>>>>>
>>>>>> On Tue, Apr 5, 2016 at 6:27 PM, Darren Hoo <darren.hoo@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Todd,
>>>>>>>
>>>>>>> let me try giving a little more details here.
>>>>>>>
>>>>>>> When I first created the table and loaded about 100k records,
kudu
>>>>>>> tablet  server started to crash and very often.
>>>>>>>
>>>>>>> So I suspect that maybe the data file is corrupted and I dump
the
>>>>>>> table as parquet file ,
>>>>>>> drop the table, recreate the table, and import the parquet file
>>>>>>> again.
>>>>>>>
>>>>>>> But after I did that, the tablet server still crashes often utill
I
>>>>>>> increase the memory limit to 16GB,
>>>>>>> then the tablet server crashes less often, one time for serveral
>>>>>>> days.
>>>>>>>
>>>>>>> There's one big STRING column in my table, but the column should
not
>>>>>>> be bigger than 4k in size
>>>>>>> as kudu document recommends.
>>>>>>>
>>>>>>
>>>>>> OK, that's definitely an interesting part of the story. Although
we
>>>>>> think that 4k strings should be OK, the testing in this kind of workload
>>>>>> has not been as extensive.
>>>>>>
>>>>>> If you are able to share the Parquet file and "create table" command
>>>>>> for the dataset off-list, that would be great. I'll keep it only
within our
>>>>>> datacenter and delete it when done debugging.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I will try to create a minmal dataset to reproduce the issue,
but I
>>>>>>> am not sure I can create one.
>>>>>>>
>>>>>>
>>>>>> Thanks, that would be great if the larger dataset can't be shared.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> here's the core dump compressed,
>>>>>>>
>>>>>>> http://188.166.175.200/core.90197.bz2
>>>>>>>
>>>>>>> the exact kudu version is : 0.7.1-1.kudu0.7.1.p0.36   (installed
>>>>>>> from parcel)
>>>>>>>
>>>>>>>
>>>>>> OK, thank you. I"m downloading it now and will take a look tonight
or
>>>>>> tomorrow.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>>
>>>>>>> On Wed, Apr 6, 2016 at 8:59 AM, Todd Lipcon <todd@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Darren,
>>>>>>>>
>>>>>>>> This is interesting. I haven't seen a crash that looks like
this,
>>>>>>>> and not sure why it would cause data to disappear either.
>>>>>>>>
>>>>>>>> By any chance do you have some workload that can reproduce
the
>>>>>>>> issue? e.g. a particular data set that you are loading that
seems to be
>>>>>>>> causing problems?
>>>>>>>>
>>>>>>>> Maybe you can gzip the core file and send it to me off-list
if it
>>>>>>>> isn't too large?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
View raw message