hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sivabalan narayanan (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-432) Benchmark HFile for scan vs seek
Date Sat, 04 Jan 2020 02:50:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

sivabalan narayanan updated HUDI-432:
-------------------------------------
    Description: 
We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing. HFile
will be used inline in hudi log for index purposes. 

So, as part of benchmarking, we want to see when does scan out performs seek. 

This is our experiment set up.

keysToRead = no of keys to be looked up. // differs for different exp runs like 100k, 200k,
500k, 1M. 

N = no of iterations

 
{code:java}
1M entries were written to a single HFile as key value pairs. 
Also, stored the keys in a separate file(key_file).
keyList = read all keys from key_file
for N no of iterations
{
    shuffle keyList 
    trim the list to keysToRead 
    start timer HFile 
    read benchmark(scan/seek) 
    end timer
}
found avg for all timers captured
{code}
 

 

Result:

Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized
configs.

  !Screen Shot 2020-01-03 at 6.44.25 PM.png!

Results can be found here: [^HFile benchmark.xlsx]

Source for benchmarking can be found here: 

[https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]

> Benchmark HFile for scan vs seek
> --------------------------------
>
>                 Key: HUDI-432
>                 URL: https://issues.apache.org/jira/browse/HUDI-432
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Storage Management
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.5.2
>
>         Attachments: HFile benchmark.xlsx, Screen Shot 2020-01-03 at 6.44.25 PM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing.
HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs like 100k,
200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
>     shuffle keyList 
>     trim the list to keysToRead 
>     start timer HFile 
>     read benchmark(scan/seek) 
>     end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized
configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message