hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
Date Mon, 02 Mar 2020 18:55:03 GMT
bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while
storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593560005
 
 
   @lamber-ken @leesf @nsivabalan : Yes, the additional string conversion is not needed. So,
I refactored a little bit to use correct bloom-filter serialization method (based on whether
compression is enabled or not). 
   
   @lamber-ken : I am observing the same behavior when comparing compression vs non-compression
case. I see that compression performs poorly based on the bloom filter utilization (number
of keys stored in bloom-filter).  I see that snappy also behaves in the same way (although
poorly compared to gzip).  I would need to investigate further on this.
   
   Result 
   
   ```
   test random keys
   original size: 4792548
   compress size (utilization=10%) : 2150956, CompressToOriginal=44
   compress size (utilization=20%) : 3078736, CompressToOriginal=64
   compress size (utilization=30%) : 3638548, CompressToOriginal=75
   compress size (utilization=40%) : 3977508, CompressToOriginal=82
   compress size (utilization=50%) : 4258972, CompressToOriginal=88
   compress size (utilization=60%) : 4490484, CompressToOriginal=93
   compress size (utilization=70%) : 4647776, CompressToOriginal=96
   compress size (utilization=80%) : 4750028, CompressToOriginal=99
   compress size (utilization=90%) : 4794040, CompressToOriginal=100
   
   test sequential keys
   original size: 4792548
   Using Byte[] - compress size (utilization=10%) : 2150852, CompressToOriginal=44
   Using Byte[] - compress size (utilization=20%) : 3078332, CompressToOriginal=64
   Using Byte[] - compress size (utilization=30%) : 3639000, CompressToOriginal=75
   Using Byte[] - compress size (utilization=40%) : 3977764, CompressToOriginal=82
   Using Byte[] - compress size (utilization=50%) : 4258544, CompressToOriginal=88
   Using Byte[] - compress size (utilization=60%) : 4490372, CompressToOriginal=93
   Using Byte[] - compress size (utilization=70%) : 4647832, CompressToOriginal=96
   Using Byte[] - compress size (utilization=80%) : 4749928, CompressToOriginal=99
   Using Byte[] - compress size (utilization=90%) : 4794040, CompressToOriginal=100
   
   Process finished with exit code 0
   
   ```
   
   Test - Code : 
   ```
   @Test
     public void testit() {
       int[] utilization = new int[] { 10, 20, 30, 40, 50, 60, 70, 80, 90};
   
       System.out.println("test random keys");
       int originalSize = 0;
       for (int i = 0; i < utilization.length; i++) {
         SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH);
         int numKeys = 10000 * utilization[i];
         for (int j = 0; j < numKeys; j++) {
           String key = UUID.randomUUID().toString();
           filter.add(key);
         }
   
         if (i == 0) {
           originalSize = filter.serializeToString().length();
           System.out.println("original size: " + filter.serializeToString().length());
         }
         int compressedSize = GzipCompressionUtils.compress(filter.serializeToBytes()).length();
         System.out.println("compress size (utilization=" + utilization[i] + "%) : "
             +  compressedSize + ", CompressToOriginal=" + (compressedSize * 100/originalSize));
       }
   
       System.out.println("\ntest sequential keys");
   
       for (int i = 0; i < utilization.length; i++) {
         SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH);
         int numKeys = 10000 * utilization[i];
         for (int j = 0; j < numKeys; j++) {
           String key = "key-" + j;
           filter.add(key);
         }
         if (i == 0) {
           originalSize = filter.serializeToString().length();
           System.out.println("original size: " + filter.serializeToString().length());
         }
         int compressedSize = GzipCompressionUtils.compress(filter.serializeToBytes()).length();
         System.out.println("Using Byte[] - compress size (utilization=" + utilization[i]
+ "%) : "
             + compressedSize + ", CompressToOriginal=" + (compressedSize * 100/originalSize));
       }
     }
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message