hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lamber-ken (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching
Date Thu, 19 Mar 2020 16:58:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062761#comment-17062761
] 

lamber-ken commented on HUDI-686:
---------------------------------

[~vinoth] thanks for bring up this new idea. here are some concerns to consider:

1. +candidates+ may cause OOM, although we can increase the num of partitions to solve it.
that may will impact the user's experience, because use

need think about it.
{quote}List<Pair<HoodieRecord<T>, String>> candidates = new ArrayList<>();
{quote}
 2.  +fileIDToBloomFilter+ is an external map that spills content to disk, we need to think
about the seri / dese performance
{quote}this.fileIDToBloomFilter = new ExternalSpillableMap<>(1000000000L ...)BloomFilter
filter = fileIDToBloomFilter.get(partitionFileIdPair.getRight());
{quote}
[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
protected List<Pair<HoodieRecord<T>, String>> computeNext() {
  List<Pair<HoodieRecord<T>, String>> candidates = new ArrayList<>();
  if (inputItr.hasNext()) {
    HoodieRecord<T> record = inputItr.next();
    try {
      initIfNeeded(record.getPartitionPath());
    } catch (IOException e) {
      throw new HoodieIOException(
          "Error reading index metadata for " + record.getPartitionPath(), e);
    }

    indexFileFilter
        .getMatchingFilesAndPartition(record.getPartitionPath(), record.getRecordKey())
        .forEach(partitionFileIdPair -> {
          BloomFilter filter = fileIDToBloomFilter.get(partitionFileIdPair.getRight());
          if (filter.mightContain(record.getRecordKey())) {
            candidates.add(Pair.of(record, partitionFileIdPair.getRight()));
          }
        });

    if (candidates.size() == 0) {
      candidates.add(Pair.of(record, ""));
    }
  }

  return candidates;
}
{code}
 

> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
>                 Key: HUDI-686
>                 URL: https://issues.apache.org/jira/browse/HUDI-686
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Index, Performance
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>
> Main goals here is to provide a much simpler index, without advanced optimizations like
auto tuned parallelism/skew handling but a better out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message