hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-432) Benchmark HFile for scan vs seek
Date Mon, 09 Mar 2020 07:28:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-432:
--------------------------------
    Fix Version/s:     (was: 0.5.2)
                   0.6.0

> Benchmark HFile for scan vs seek
> --------------------------------
>
>                 Key: HUDI-432
>                 URL: https://issues.apache.org/jira/browse/HUDI-432
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Storage Management
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, Screen Shot 2020-01-03
at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing.
HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs like 100k,
200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
>     shuffle keyList 
>     trim the list to keysToRead 
>     start timer HFile 
>     read benchmark(scan/seek) 
>     end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized
configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message