hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching
Date Thu, 19 Mar 2020 21:14:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062940#comment-17062940
] 

Vinoth Chandar commented on HUDI-686:
-------------------------------------

Timing the individual stages 

Roughly, here is how it looks like ..  Not sure how much more we can optimize this further,
since the time spent is mostly inside parquet reading 

the metadata read cost i.e reading the footers dominates the first stage 
{code:java}
System.err.format("LazyRangeBloomChecker: %d, %d, %d, %d, %d, %d \n",
                    totalCount, totalMatches, totalTimeNs, totalMetadataReadTimeNs, totalRangeCheckTimeNs,
totalBloomCheckTimeNs);

LazyRangeBloomChecker: 18632, 5068, 481673381, 439685698, 5872344, 26426499 
LazyRangeBloomChecker: 29312, 0, 397373925, 361189515, 12336753, 3205152 
LazyRangeBloomChecker: 36422, 0, 395838972, 364965143, 6870027, 3088563 
LazyRangeBloomChecker: 32698, 21252, 502987672, 374374961, 15190330, 94478078 
LazyRangeBloomChecker: 36633, 0, 420441840, 388992165, 7971801, 5222196 
LazyRangeBloomChecker: 35919, 35919, 547982738, 382770288, 17042127, 130529090 
LazyRangeBloomChecker: 26448, 26448, 673972735, 497887634, 12918131, 150188682 
LazyRangeBloomChecker: 29827, 25338, 739789660, 568953445, 14633164, 140977007 
LazyRangeBloomChecker: 40694, 40694, 611867636, 364297491, 20609305, 206717514 
LazyRangeBloomChecker: 41515, 41515, 754657982, 379440879, 18670251, 337857948 
LazyRangeBloomChecker: 46672, 46672, 761187684, 364060859, 18887398, 359483525 
LazyRangeBloomChecker: 26931, 2360, 296764733, 275044606, 3439711, 11417543 
LazyRangeBloomChecker: 41863, 20714, 831527864, 656157121, 13784027, 143710665 
LazyRangeBloomChecker: 36429, 0, 181597122, 157965082, 5342164, 3072219 
LazyRangeBloomChecker: 45618, 0, 180005379, 154248797, 6254112, 3332647 
LazyRangeBloomChecker: 60916, 60916, 730395000, 244153313, 24926738, 439724359 
 {code}
the reading of the actual keys themselves, dominate the second.. 
{code:java}
System.err.println("LazyKeyChecker: " + totalTimeNs + "," + totalCount + "," + totalReadTimeNs);

LazyKeyChecker: 32576530,2119,30998522
LazyKeyChecker: 39189497,3415,36666074
LazyKeyChecker: 36683534,3726,33878272
LazyKeyChecker: 293554458,38523,264821882
LazyKeyChecker: 297414709,39263,268215304
LazyKeyChecker: 212946950,65474,169525572
LazyKeyChecker: 1047598045,65998,1003946915
LazyKeyChecker: 1048062757,66734,1003969635
LazyKeyChecker: 1041348181,74948,992863777
[Stage 141:================================== {code}

> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
>                 Key: HUDI-686
>                 URL: https://issues.apache.org/jira/browse/HUDI-686
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Index, Performance
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19
at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced optimizations like
auto tuned parallelism/skew handling but a better out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message