kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "阿香" <1654407...@qq.com>
Subject 回复: About data file size and on-disk size
Date Mon, 12 Dec 2016 09:41:13 GMT
Todd, 


Thanks. 
Have not yet trying to get back the empty space from the container.
I will try it later this month.


By the way, when will kudu's next release come out?  Will 1.2 release in mid-January include
this fix?


Thanks.
BR
-GU




------------------ 原始邮件 ------------------
发件人: "Todd Lipcon";<todd@cloudera.com>;
发送时间: 2016年12月12日(星期一) 下午2:26
收件人: "user"<user@kudu.apache.org>; 

主题: Re: About data file size and on-disk size



Just a follow-up note here: if you did end up cherry-picking that change, you should also
be sure to cherry-pick faa587c639aa9e5dcf3fac04259f46ba1921140a to avoid a potential data
loss bug.

On Wed, Nov 30, 2016 at 9:00 AM, Adar Dembo <adar@cloudera.com> wrote:
If you're comfortable rebuilding Kudu from source, you can apply https://gerrit.cloudera.org/#/c/5254,
rebuild the tserver, and restart it. Once the tserver is done restarting, it should trim the
empty space off of the ends of all of your container data files.

Otherwise, you'll have to wait until the next Kudu release.


On Tue, Nov 29, 2016 at 5:48 PM, 阿香 <1654407779@qq.com> wrote:


Hi Todd, 


Thanks. 
From the results, I think you successfully got the bug.  
By the way, can I get back the wasted disk space? 





# du -sm 542d51e55d524034a5274600c31abd11.data
29	542d51e55d524034a5274600c31abd11.data


# filefrag -v -b 542d51e55d524034a5274600c31abd11.data


filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: ef53
File size of 542d51e55d524034a5274600c31abd11.data is 10767867904 (10515496 blocks of 1024
bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0: 10486144..10497543:  278086588.. 278097987:  11400:             unwritten
   1: 10497544..10514191:  278691588.. 278708235:  16648:  278097988: unwritten
   2: 10514192..10514199:  279581160.. 279581167:      8:  278708236: unwritten
   3: 10514200..10514203:  280291284.. 280291287:      4:  279581168: unwritten
   4: 10514204..10514227:  280652252.. 280652275:     24:  280291288: unwritten
   5: 10514228..10515259:  281289216.. 281290247:   1032:  280652276: unwritten
   6: 10515260..10515263:  282068816.. 282068819:      4:  281290248: unwritten
   7: 10515264..10515495:  283429184.. 283429415:    232:  282068820: unwritten,eof
542d51e55d524034a5274600c31abd11.data: 8 extents found


# echo $[11400 + 16648 + 1032 + 232]
29312


# ls -l 542d51e55d524034a5274600c31abd11.data
-rw-r--r-- 1 kudu kudu 10767867904 Oct 26 06:51 542d51e55d524034a5274600c31abd11.data


# ls -lh 542d51e55d524034a5274600c31abd11.data
-rw-r--r-- 1 kudu kudu 11G Oct 26 06:51 542d51e55d524034a5274600c31abd11.data



BR
-GU


------------------ 原始邮件 ------------------
发件人: "Todd Lipcon";<todd@cloudera.com>;
发送时间: 2016年11月29日(星期二) 凌晨4:15
收件人: "user"<user@kudu.apache.org>; 

主题: Re: About data file size and on-disk size





Hi Xiang,

Adar and I did some investigation and came up with a likely cause: https://issues.apache.org/jira/browse/KUDU-1764


Can you please try the following on one of your .data files? (preferably one which has a modification
time a few weeks old?)


$ du -sm abcdef.data
$ filefrag -v -b abcdef.data
$ ls -l abcdef.data


We can use this to confirm whether you are hitting the same bug we just discovered.


Thanks
-Todd


On Thu, Nov 24, 2016 at 6:57 AM, 阿香 <1654407779@qq.com> wrote:


> If the workload doesn't involve normal (merging) compactions, then UNDOs won't be GCed
at all. So, if you have a relatively static set of keys, and are just updating them without
causing many new inserts, this could be the problem.


The keys are not relatively static and increasing all the time.
The key of the table is a uuid string with hash partition (16 buckets).
Currently there are about 1000,000,000 rows in this cluster.


Will these big data files increase the latency time of the upsert operation?


I saw the metrics like following by kudu web UI. 
            {                 "name": "write_op_duration_client_propagated_consistency", 
               "total_count": 8568729,                 "min": 116,                 "mean":
2499.56,                 "percentile_75": 2176,                 "percentile_95": 7680,   
             "percentile_99": 29568,                 "percentile_99_9": 78336,           
     "percentile_99_99": 123904,                 "max": 1562967,                 "total_sum":
21418050385             }






------------------ 原始邮件 ------------------
发件人: "Todd Lipcon";<todd@cloudera.com>;
发送时间: 2016年11月24日(星期四) 中午11:55
收件人: "user"<user@kudu.apache.org>; 

主题: Re: About data file size and on-disk size



On Wed, Nov 23, 2016 at 2:30 PM, Adar Dembo <adar@cloudera.com> wrote:
The difference between du with --apparent-size and without suggests
 that hole punching is working properly. Quick back of the envelope
 math shows that with 8133 containers, each container is just over 10G
 of "apparent size", which means nearly all of the containers were full
 at one point or another. That makes sense; it means that Kudu is
 generally writing to a small number of containers at any given time,
 but is filling them up over time.
 
 I took a look at the tablet disk estimation code and found that it
 excludes the size of all of the UNDO data blocks. I think this is
 because the size estimation is also used to drive decisions regarding
 delta compaction, but with an UPSERT-only workload like yours, we'd
 expect to see many UNDO data blocks over time as updated (and now
 historical) data is further and further compacted. I filed
 https://issues.apache.org/jira/browse/KUDU-1755 to track these issues.
 However, if this were the case, I'd expect the "tablet history GC"
 feature (new in Kudu 1.0) to remove old data that was mutated in an
 UPSERT. The default value for --tablet_history_max_age_sec (which
 controls how old the data must be before it is removed) is 15 minutes;
 have you changed the value of this flag? If not, could you look at
 your tserver log for the presence of major delta compactions? Look for
 references to MajorDeltaCompactionOp. If there aren't any, that means
 Kudu isn't getting opportunities to age out old data.


Worth noting that major delta compaction doesn't actually remove old UNDOs. There are still
some open JIRAs about scheduling tasks to age-off UNDOs, but as it stands today, they only
get collected during a normal compaction.


If the workload doesn't involve normal (merging) compactions, then UNDOs won't be GCed at
all. So, if you have a relatively static set of keys, and are just updating them without causing
many new inserts, this could be the problem.
 
 
 It's also possible that simply not accounting for the composite index
 and bloom blocks (see KUDU-1755) is the reason. Take a look at
 https://issues.apache.org/jira/browse/KUDU-624?focusedCommentId=15165054&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15165054
 and run the same two commands to compare the total on-disk size of all
 the .data files to the number of bytes that the tserver is aware of.
 If the two numbers are close, it's a sign that, at the very least,
 Kudu is aware of and actively managing all that disk space (i.e.
 there's no "orphaned" data).


-Todd
 
 
 
 
 On Wed, Nov 23, 2016 at 12:39 AM, 阿香 <1654407779@qq.com> wrote:
 > Hi,
 >
 >> Can you tell us a little bit more about your table, as well as any deleted
 >> tables you once had? How many columns did they have?
 >
 > I do not delete any tables before.
 > There is only one table with 12 columns(string and int) in the kudu cluster.
 > This cluster has three tablet servers.
 >
 > I use upsert operation to insert&update rows.
 >
 >> what version of Kudu are you using?
 >
 > kudu -version
 > kudu 1.0.0
 > revision 6f6e49ca98c3e3be7d81f88ab8a0f9173959b191
 > build type RELEASE
 > built by jenkins at 16 Sep 2016 00:23:10 PST on
 > impala-ec2-pkg-centos-7-0dc0.vpc.cloudera.com
 > build id 2016-09-16_00-03-04
 >
 >> It's conceivable that there's a pathological case wherein each of the 8133
 >> data files is used, one at a time, to store data blocks, which would cause
 >> each to allocate 32 MB of disk space (totaling about 254G).
 >
 > Can the number of data files be decreased? The SSD disk is almost out of
 > space now.
 >
 >> Can you try running du with --apparent-size and compare the results?
 >
 > # du -sh /data/kudu/tserver/data/
 > 213G /data/kudu/tserver/data/
 > # du -sh --apparent-size  /data/kudu/tserver/data/
 > 81T /data/kudu/tserver/data/
 >
 >> What filesystem is being used for /data/kudu/tserver/data?
 >
 > # file -s /dev/vdb1
 > /dev/vdb1: Linux rev 1.0 ext4 filesystem data,
 > UUID=9f95ba79-f387-42be-a43f-d1421c83e2e5 (needs journal recovery) (extents)
 > (64bit) (large files) (huge files)
 >
 >
 > Thanks.
 >
 >
 > ------------------ 原始邮件 ------------------
 > 发件人: "Adar Dembo";<adar@cloudera.com>;
 > 发送时间: 2016年11月23日(星期三) 上午9:35
 > 收件人: "user"<user@kudu.apache.org>;
 > 主题: Re: About data file size and on-disk size
 >
 > Also, if you haven't explicitly disabled it, each .data file is going
 > to preallocate 32 MB of data when used. It's conceivable that there's
 > a pathological case wherein each of the 8133 data files is used, one
 > at a time, to store data blocks, which would cause each to allocate 32
 > MB of disk space (totaling about 254G).
 >
 > Can you tell us a little bit more about your table, as well as any
 > deleted tables you once had? How many columns did they have? Also,
 > what version of Kudu are you using?
 >
 > On Tue, Nov 22, 2016 at 11:39 AM, Adar Dembo <adar@cloudera.com> wrote:
 >> The files in /data/kudu/tserver/data are supposed to be sparse; that
 >> is, when Kudu decides to delete data, it'll punch a hole in one of
 >> those files, allowing the filesystem to reclaim the space in that
 >> hole. Yet, 'du' should reflect that because it measures real space
 >> usage. Can you try running du with --apparent-size and compare the
 >> results? If they're the same or similar, it suggests that the hole
 >> punching behavior isn't working properly. What distribution are you
 >> using? What filesystem is being used for /data/kudu/tserver/data?
 >>
 >> You should also check if maybe Kudu has failed to delete the data
 >> belonging to deleted tables. Has this tserver hosted any tablets
 >> belonging to tables that have since been deleted? Does the tserver log
 >> describe any errors when trying to delete the data belonging to those
 >> tablets?
 >>
 >> On Tue, Nov 22, 2016 at 7:19 AM, 阿香 <1654407779@qq.com> wrote:
 >>> Hi,
 >>>
 >>>
 >>> I have a table with 16 buckets over 3 physical machines. The tablet only
 >>> has
 >>> one replica.
 >>>
 >>>
 >>> Tablets Web UI shows that each tablet has around ~4.5G on-disk size.
 >>>
 >>> In one machine, there are total  8 tablets, so the on-disk size is about
 >>> 4.5*8 = 36G.
 >>>
 >>> however, in the same machine, the disk actually used is about 211G.
 >>>
 >>>
 >>> # du -sh /data/kudu/tserver/data/
 >>>
 >>> 210G /data/kudu/tserver/data/
 >>>
 >>>
 >>> # find /data/kudu/tserver/data/ -name "*.data" | wc -l
 >>>
 >>> 8133
 >>>
 >>>
 >>>
 >>> What’s the difference between data file and on-disk size.
 >>>
 >>> Can files in  /data/kudu/tserver/data/ be compacted, purged, or some of
 >>> them
 >>> be deleted?
 >>>
 >>>
 >>> Thanks very much.
 >>>
 >>>
 >>> BR
 >>>
 >>> Brooks
 >>>
 >>>
 >>>
 








-- 
Todd Lipcon
Software Engineer, Cloudera
 







-- 
Todd Lipcon
Software Engineer, Cloudera
 






 






-- 
Todd Lipcon
Software Engineer, Cloudera
Mime
View raw message