hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lamber-ken (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching
Date Tue, 24 Mar 2020 05:42:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065030#comment-17065030
] 

lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM:
-----------------------------------------------------------

right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" contains all datas
for per partition
 * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs()
&& populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>>
recordRDD,
                                            JavaSparkContext jsc,
                                            HoodieTable<T> hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(),
record.getRecordKey()),
      true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
      .flatMap(List::iterator)
      .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
      .filter(Option::isPresent)
      .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
    cleanup();
    this.currentPartitionPath = partitionPath;
    populateFileIDs();
    populateRangeAndBloomFilters();
  }
}{code}


was (Author: lamber-ken):
right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" contains all partition
datas
 * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs()
&& populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>>
recordRDD,
                                            JavaSparkContext jsc,
                                            HoodieTable<T> hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(),
record.getRecordKey()),
      true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
      .flatMap(List::iterator)
      .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
      .filter(Option::isPresent)
      .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
    cleanup();
    this.currentPartitionPath = partitionPath;
    populateFileIDs();
    populateRangeAndBloomFilters();
  }
}{code}

> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
>                 Key: HUDI-686
>                 URL: https://issues.apache.org/jira/browse/HUDI-686
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Index, Performance
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19
at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced optimizations like
auto tuned parallelism/skew handling but a better out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message