hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-335) Improvements to DiskBasedMap
Date Wed, 08 Jan 2020 05:16:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-335:
--------------------------------
    Fix Version/s:     (was: 0.5.1)
                   0.5.2

> Improvements to DiskBasedMap
> ----------------------------
>
>                 Key: HUDI-335
>                 URL: https://issues.apache.org/jira/browse/HUDI-335
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Common Core
>            Reporter: Balajee Nagasubramaniam
>            Priority: Major
>              Labels: Hoodie, pull-request-available
>             Fix For: 0.5.2
>
>         Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot 2019-11-13
at 2.56.53 PM.png
>
>   Original Estimate: 504h
>          Time Spent: 20m
>  Remaining Estimate: 503h 40m
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the record on disk.
> This change improves the performance of the record get/read operation to disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising.   Before the write performance improvement, spilling/writing
1 million records (record size ~ 350 bytes) to the file took about 104 seconds. 
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records (size ~350
bytes) from the spill file took about 23 seconds.  After the improvement, same operation can
be performed in under 4 seconds.
> {{without read/write performance improvements							
> RecordsHandled:	10000	totalTestTime:	3145	writeTime:	1176	readTime:	255
> RecordsHandled:	50000	totalTestTime:	5775	writeTime:	4187	readTime:	1175
> RecordsHandled:	100000	totalTestTime:	10570	writeTime:	7718	readTime:	2203
> RecordsHandled:	500000	totalTestTime:	59723	writeTime:	45618	readTime:	11093
> RecordsHandled:	1000000	totalTestTime:	120022	writeTime:	87918	readTime:	22355
> RecordsHandled:	2000000	totalTestTime:	258627	writeTime:	187185	readTime:	56431}}
> {{With write improvement:
> RecordsHandled:	10000	totalTestTime:	2013	writeTime:	700	readTime:	503
> RecordsHandled:	50000	totalTestTime:	2525	writeTime:	390	readTime:	1247
> RecordsHandled:	100000	totalTestTime:	3583	writeTime:	464	readTime:	2352
> RecordsHandled:	500000	totalTestTime:	22934	writeTime:	3731	readTime:	15778
> RecordsHandled:	1000000	totalTestTime:	42415	writeTime:	4816	readTime:	30332
> RecordsHandled:	2000000	totalTestTime:	74158	writeTime:	10192	readTime:	53195}}
> {{With read improvements:
> RecordsHandled:	10000	totalTestTime:	2473	writeTime:	1562	readTime:	87
> RecordsHandled:	50000	totalTestTime:	6169	writeTime:	5151	readTime:	438
> RecordsHandled:	100000	totalTestTime:	9967	writeTime:	8636	readTime:	252
> RecordsHandled:	500000	totalTestTime:	50889	writeTime:	46766	readTime:	1014
> RecordsHandled:	1000000	totalTestTime:	114482	writeTime:	104353	readTime:	3776
> RecordsHandled:	2000000	totalTestTime:	239251	writeTime:	219041	readTime:	8127}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message