orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wgtmac <...@git.apache.org>
Subject [GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...
Date Wed, 29 Aug 2018 16:17:41 GMT
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/304#discussion_r213744153
  
    --- Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java ---
    @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception
{
     
       }
     
    +  /**
    +   * Test that dictionaries can be disabled, per column. In this test, we want to disable
DICTIONARY_V2 for the
    +   * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2
for `shortString`.
    +   * @throws Exception on unexpected failure
    +   */
    +  @Test
    +  public void testDisableDictionaryForSpecificColumn() throws Exception {
    +    final String SHORT_STRING_VALUE = "foo";
    +    final String  LONG_STRING_VALUE = "BAAAAAAAAR!!";
    +
    +    TypeDescription schema =
    +        TypeDescription.fromString("struct<shortString:string,longString:string>");
    +
    +    Writer writer = OrcFile.createWriter(
    +        testFilePath,
    +        OrcFile.writerOptions(conf).setSchema(schema)
    +            .compress(CompressionKind.NONE)
    +            .bufferSize(10000)
    +            .directEncodingColumns("longString"));
    --- End diff --
    
    That makes sense. I will also port current dictionary encoding to C++ writer shortly.
    BTW, we plan to do some testing about global dictionary which is shared by all stripes
in that file. Can we come up with a design in ORC V2? I can propose a prototype after gathering
certain experiment results.


---

Mime
View raw message