kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: About data file size and on-disk size
Date Mon, 12 Dec 2016 06:26:11 GMT
Just a follow-up note here: if you did end up cherry-picking that change,
you should also be sure to
cherry-pick faa587c639aa9e5dcf3fac04259f46ba1921140a to avoid a potential
data loss bug.

On Wed, Nov 30, 2016 at 9:00 AM, Adar Dembo <adar@cloudera.com> wrote:

> If you're comfortable rebuilding Kudu from source, you can apply
> https://gerrit.cloudera.org/#/c/5254, rebuild the tserver, and restart
> it. Once the tserver is done restarting, it should trim the empty space off
> of the ends of all of your container data files.
>
> Otherwise, you'll have to wait until the next Kudu release.
>
> On Tue, Nov 29, 2016 at 5:48 PM, 阿香 <1654407779@qq.com> wrote:
>
>>
>> Hi Todd,
>>
>> Thanks.
>> From the results, I think you successfully got the bug.
>> By the way, can I get back the wasted disk space?
>>
>>
>> # du -sm 542d51e55d524034a5274600c31abd11.data
>> 29 542d51e55d524034a5274600c31abd11.data
>>
>> # filefrag -v -b 542d51e55d524034a5274600c31abd11.data
>>
>> filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
>> Filesystem type is: ef53
>> File size of 542d51e55d524034a5274600c31abd11.data is 10767867904
>> (10515496 blocks of 1024 bytes)
>>  ext:     logical_offset:        physical_offset: length:   expected:
>> flags:
>>    0: 10486144..10497543:  278086588.. 278097987:  11400:
>> unwritten
>>    1: 10497544..10514191:  278691588.. 278708235:  16648:  278097988:
>> unwritten
>>    2: 10514192..10514199:  279581160.. 279581167:      8:  278708236:
>> unwritten
>>    3: 10514200..10514203:  280291284.. 280291287:      4:  279581168:
>> unwritten
>>    4: 10514204..10514227:  280652252.. 280652275:     24:  280291288:
>> unwritten
>>    5: 10514228..10515259:  281289216.. 281290247:   1032:  280652276:
>> unwritten
>>    6: 10515260..10515263:  282068816.. 282068819:      4:  281290248:
>> unwritten
>>    7: 10515264..10515495:  283429184.. 283429415:    232:  282068820:
>> unwritten,eof
>> 542d51e55d524034a5274600c31abd11.data: 8 extents found
>>
>> # echo $[11400 + 16648 + 1032 + 232]
>> 29312
>>
>> # ls -l 542d51e55d524034a5274600c31abd11.data
>> -rw-r--r-- 1 kudu kudu 10767867904 Oct 26 06:51
>> 542d51e55d524034a5274600c31abd11.data
>>
>> # ls -lh 542d51e55d524034a5274600c31abd11.data
>> -rw-r--r-- 1 kudu kudu 11G Oct 26 06:51 542d51e55d524034a5274600c31abd
>> 11.data
>>
>> BR
>> -GU
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Todd Lipcon";<todd@cloudera.com>;
>> *发送时间:* 2016年11月29日(星期二) 凌晨4:15
>> *收件人:* "user"<user@kudu.apache.org>;
>> *主题:* Re: About data file size and on-disk size
>>
>> Hi Xiang,
>>
>> Adar and I did some investigation and came up with a likely cause:
>> https://issues.apache.org/jira/browse/KUDU-1764
>>
>> Can you please try the following on one of your .data files? (preferably
>> one which has a modification time a few weeks old?)
>>
>> $ du -sm abcdef.data
>> $ filefrag -v -b abcdef.data
>> $ ls -l abcdef.data
>>
>> We can use this to confirm whether you are hitting the same bug we just
>> discovered.
>>
>> Thanks
>> -Todd
>>
>> On Thu, Nov 24, 2016 at 6:57 AM, 阿香 <1654407779@qq.com> wrote:
>>
>>>
>>> > If the workload doesn't involve normal (merging) compactions, then
>>> UNDOs won't be GCed at all. So, if you have a relatively static set of
>>> keys, and are just updating them without causing many new inserts, this
>>> could be the problem.
>>>
>>> The keys are not relatively static and increasing all the time.
>>> The key of the table is a uuid string with hash partition (16 buckets).
>>> Currently there are about 1000,000,000 rows in this cluster.
>>>
>>> Will these big data files increase the latency time of the upsert
>>> operation?
>>>
>>> I saw the metrics like following by kudu web UI.
>>>
>>>             {
>>>                 "name": "write_op_duration_client_propagated_consistency",
>>>                 "total_count": 8568729,
>>>                 "min": 116,
>>>                 "mean": 2499.56,
>>>                 "percentile_75": 2176,
>>>                 "percentile_95": 7680,
>>>                 "percentile_99": 29568,
>>>                 "percentile_99_9": 78336,
>>>                 "percentile_99_99": 123904,
>>>                 "max": 1562967,
>>>                 "total_sum": 21418050385
>>>             }
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Todd Lipcon";<todd@cloudera.com>;
>>> *发送时间:* 2016年11月24日(星期四) 中午11:55
>>> *收件人:* "user"<user@kudu.apache.org>;
>>> *主题:* Re: About data file size and on-disk size
>>>
>>> On Wed, Nov 23, 2016 at 2:30 PM, Adar Dembo <adar@cloudera.com> wrote:
>>>
>>>> The difference between du with --apparent-size and without suggests
>>>> that hole punching is working properly. Quick back of the envelope
>>>> math shows that with 8133 containers, each container is just over 10G
>>>> of "apparent size", which means nearly all of the containers were full
>>>> at one point or another. That makes sense; it means that Kudu is
>>>> generally writing to a small number of containers at any given time,
>>>> but is filling them up over time.
>>>>
>>>> I took a look at the tablet disk estimation code and found that it
>>>> excludes the size of all of the UNDO data blocks. I think this is
>>>> because the size estimation is also used to drive decisions regarding
>>>> delta compaction, but with an UPSERT-only workload like yours, we'd
>>>> expect to see many UNDO data blocks over time as updated (and now
>>>> historical) data is further and further compacted. I filed
>>>> https://issues.apache.org/jira/browse/KUDU-1755 to track these issues.
>>>> However, if this were the case, I'd expect the "tablet history GC"
>>>> feature (new in Kudu 1.0) to remove old data that was mutated in an
>>>> UPSERT. The default value for --tablet_history_max_age_sec (which
>>>> controls how old the data must be before it is removed) is 15 minutes;
>>>> have you changed the value of this flag? If not, could you look at
>>>> your tserver log for the presence of major delta compactions? Look for
>>>> references to MajorDeltaCompactionOp. If there aren't any, that means
>>>> Kudu isn't getting opportunities to age out old data.
>>>>
>>>
>>> Worth noting that major delta compaction doesn't actually remove old
>>> UNDOs. There are still some open JIRAs about scheduling tasks to age-off
>>> UNDOs, but as it stands today, they only get collected during a normal
>>> compaction.
>>>
>>> If the workload doesn't involve normal (merging) compactions, then UNDOs
>>> won't be GCed at all. So, if you have a relatively static set of keys, and
>>> are just updating them without causing many new inserts, this could be the
>>> problem.
>>>
>>>
>>>>
>>>> It's also possible that simply not accounting for the composite index
>>>> and bloom blocks (see KUDU-1755) is the reason. Take a look at
>>>> https://issues.apache.org/jira/browse/KUDU-624?focusedCommen
>>>> tId=15165054&page=com.atlassian.jira.plugin.system.issuetabp
>>>> anels:comment-tabpanel#comment-15165054
>>>> and run the same two commands to compare the total on-disk size of all
>>>> the .data files to the number of bytes that the tserver is aware of.
>>>> If the two numbers are close, it's a sign that, at the very least,
>>>> Kudu is aware of and actively managing all that disk space (i.e.
>>>> there's no "orphaned" data).
>>>>
>>>
>>> -Todd
>>>
>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 23, 2016 at 12:39 AM, 阿香 <1654407779@qq.com> wrote:
>>>> > Hi,
>>>> >
>>>> >> Can you tell us a little bit more about your table, as well as any
>>>> deleted
>>>> >> tables you once had? How many columns did they have?
>>>> >
>>>> > I do not delete any tables before.
>>>> > There is only one table with 12 columns(string and int) in the kudu
>>>> cluster.
>>>> > This cluster has three tablet servers.
>>>> >
>>>> > I use upsert operation to insert&update rows.
>>>> >
>>>> >> what version of Kudu are you using?
>>>> >
>>>> > kudu -version
>>>> > kudu 1.0.0
>>>> > revision 6f6e49ca98c3e3be7d81f88ab8a0f9173959b191
>>>> > build type RELEASE
>>>> > built by jenkins at 16 Sep 2016 00:23:10 PST on
>>>> > impala-ec2-pkg-centos-7-0dc0.vpc.cloudera.com
>>>> > build id 2016-09-16_00-03-04
>>>> >
>>>> >> It's conceivable that there's a pathological case wherein each of
>>>> the 8133
>>>> >> data files is used, one at a time, to store data blocks, which would
>>>> cause
>>>> >> each to allocate 32 MB of disk space (totaling about 254G).
>>>> >
>>>> > Can the number of data files be decreased? The SSD disk is almost out
>>>> of
>>>> > space now.
>>>> >
>>>> >> Can you try running du with --apparent-size and compare the results?
>>>> >
>>>> > # du -sh /data/kudu/tserver/data/
>>>> > 213G /data/kudu/tserver/data/
>>>> > # du -sh --apparent-size  /data/kudu/tserver/data/
>>>> > 81T /data/kudu/tserver/data/
>>>> >
>>>> >> What filesystem is being used for /data/kudu/tserver/data?
>>>> >
>>>> > # file -s /dev/vdb1
>>>> > /dev/vdb1: Linux rev 1.0 ext4 filesystem data,
>>>> > UUID=9f95ba79-f387-42be-a43f-d1421c83e2e5 (needs journal recovery)
>>>> (extents)
>>>> > (64bit) (large files) (huge files)
>>>> >
>>>> >
>>>> > Thanks.
>>>> >
>>>> >
>>>> > ------------------ 原始邮件 ------------------
>>>> > 发件人: "Adar Dembo";<adar@cloudera.com>;
>>>> > 发送时间: 2016年11月23日(星期三) 上午9:35
>>>> > 收件人: "user"<user@kudu.apache.org>;
>>>> > 主题: Re: About data file size and on-disk size
>>>> >
>>>> > Also, if you haven't explicitly disabled it, each .data file is going
>>>> > to preallocate 32 MB of data when used. It's conceivable that there's
>>>> > a pathological case wherein each of the 8133 data files is used, one
>>>> > at a time, to store data blocks, which would cause each to allocate
32
>>>> > MB of disk space (totaling about 254G).
>>>> >
>>>> > Can you tell us a little bit more about your table, as well as any
>>>> > deleted tables you once had? How many columns did they have? Also,
>>>> > what version of Kudu are you using?
>>>> >
>>>> > On Tue, Nov 22, 2016 at 11:39 AM, Adar Dembo <adar@cloudera.com>
>>>> wrote:
>>>> >> The files in /data/kudu/tserver/data are supposed to be sparse;
that
>>>> >> is, when Kudu decides to delete data, it'll punch a hole in one
of
>>>> >> those files, allowing the filesystem to reclaim the space in that
>>>> >> hole. Yet, 'du' should reflect that because it measures real space
>>>> >> usage. Can you try running du with --apparent-size and compare the
>>>> >> results? If they're the same or similar, it suggests that the hole
>>>> >> punching behavior isn't working properly. What distribution are
you
>>>> >> using? What filesystem is being used for /data/kudu/tserver/data?
>>>> >>
>>>> >> You should also check if maybe Kudu has failed to delete the data
>>>> >> belonging to deleted tables. Has this tserver hosted any tablets
>>>> >> belonging to tables that have since been deleted? Does the tserver
>>>> log
>>>> >> describe any errors when trying to delete the data belonging to
those
>>>> >> tablets?
>>>> >>
>>>> >> On Tue, Nov 22, 2016 at 7:19 AM, 阿香 <1654407779@qq.com>
wrote:
>>>> >>> Hi,
>>>> >>>
>>>> >>>
>>>> >>> I have a table with 16 buckets over 3 physical machines. The
tablet
>>>> only
>>>> >>> has
>>>> >>> one replica.
>>>> >>>
>>>> >>>
>>>> >>> Tablets Web UI shows that each tablet has around ~4.5G on-disk
size.
>>>> >>>
>>>> >>> In one machine, there are total  8 tablets, so the on-disk size
is
>>>> about
>>>> >>> 4.5*8 = 36G.
>>>> >>>
>>>> >>> however, in the same machine, the disk actually used is about
211G.
>>>> >>>
>>>> >>>
>>>> >>> # du -sh /data/kudu/tserver/data/
>>>> >>>
>>>> >>> 210G /data/kudu/tserver/data/
>>>> >>>
>>>> >>>
>>>> >>> # find /data/kudu/tserver/data/ -name "*.data" | wc -l
>>>> >>>
>>>> >>> 8133
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> What’s the difference between data file and on-disk size.
>>>> >>>
>>>> >>> Can files in  /data/kudu/tserver/data/ be compacted, purged,
or
>>>> some of
>>>> >>> them
>>>> >>> be deleted?
>>>> >>>
>>>> >>>
>>>> >>> Thanks very much.
>>>> >>>
>>>> >>>
>>>> >>> BR
>>>> >>>
>>>> >>> Brooks
>>>> >>>
>>>> >>>
>>>> >>>
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message