hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17220) Bloomfilter probing in semijoin reduction is thrashing L1 dcache
Date Mon, 31 Jul 2017 22:49:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108110#comment-16108110
] 

Prasanth Jayachandran commented on HIVE-17220:
----------------------------------------------

Although bloom-1 is fast in microbenchmarks (2-5x faster as there is only 1 memory access),
there is around 2% increase in fpp. This will let more rows pass through the bloom filter
negating the performance gain. Alternative, approach is to increase the stride size for hash
mapping to more than 1 long. Will update the patch shortly with bloom-k implementation.

> Bloomfilter probing in semijoin reduction is thrashing L1 dcache
> ----------------------------------------------------------------
>
>                 Key: HIVE-17220
>                 URL: https://issues.apache.org/jira/browse/HIVE-17220
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-17220.WIP.patch
>
>
> [~gopalv] observed perf profiles showing bloomfilter probes as bottleneck for some of
the TPC-DS queries and resulted L1 data cache thrashing. 
> This is because of the huge bitset in bloom filter that doesn't fit in any levels of
cache, also the hash bits corresponding to a single key map to different segments of bitset
which are spread out. This can result in K-1 memory access (K being number of hash functions)
in worst case for every key that gets probed because of locality miss in L1 cache. 
> Ran a JMH microbenchmark to verify the same. Following is the JMH perf profile for bloom
filter probing
> {code}
> Perf stats:
> --------------------------------------------------
>        5101.935637      task-clock (msec)         #    0.461 CPUs utilized
>                346      context-switches          #    0.068 K/sec
>                336      cpu-migrations            #    0.066 K/sec
>              6,207      page-faults               #    0.001 M/sec
>     10,016,486,301      cycles                    #    1.963 GHz                    
 (26.90%)
>      5,751,692,176      stalled-cycles-frontend   #   57.42% frontend cycles idle   
 (27.05%)
>    <not supported>      stalled-cycles-backend
>     14,359,914,397      instructions              #    1.43  insns per cycle
>                                                   #    0.40  stalled cycles per insn
 (33.78%)
>      2,200,632,861      branches                  #  431.333 M/sec                  
 (33.84%)
>          1,162,860      branch-misses             #    0.05% of all branches        
 (33.97%)
>      1,025,992,254      L1-dcache-loads           #  201.099 M/sec                  
 (26.56%)
>        432,663,098      L1-dcache-load-misses     #   42.17% of all L1-dcache hits  
 (14.49%)
>        331,383,297      LLC-loads                 #   64.952 M/sec                  
 (14.47%)
>            203,524      LLC-load-misses           #    0.06% of all LL-cache hits   
 (21.67%)
>    <not supported>      L1-icache-loads
>          1,633,821      L1-icache-load-misses     #    0.320 M/sec                  
 (28.85%)
>        950,368,796      dTLB-loads                #  186.276 M/sec                  
 (28.61%)
>        246,813,393      dTLB-load-misses          #   25.97% of all dTLB cache hits 
 (14.53%)
>             25,451      iTLB-loads                #    0.005 M/sec                  
 (14.48%)
>             35,415      iTLB-load-misses          #  139.15% of all iTLB cache hits 
 (21.73%)
>    <not supported>      L1-dcache-prefetches
>            175,958      L1-dcache-prefetch-misses #    0.034 M/sec                  
 (28.94%)
>       11.064783140 seconds time elapsed
> {code}
> This shows 42.17% of L1 data cache misses. 
> This jira is to use cache efficient bloom filter for semijoin probing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message