drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ppadma <...@git.apache.org>
Subject [GitHub] drill pull request #826: DRILL-5379: Set Hdfs Block Size based on Parquet Bl...
Date Wed, 17 May 2017 01:56:29 GMT
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/826#discussion_r116895850
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
---
    @@ -380,14 +384,21 @@ public void endRecord() throws IOException {
     
           // since ParquetFileWriter will overwrite empty output file (append is not supported)
           // we need to re-apply file permission
    -      parquetFileWriter = new ParquetFileWriter(conf, schema, path, ParquetFileWriter.Mode.OVERWRITE);
    +      if (useConfiguredBlockSize) {
    --- End diff --
    
    What we are doing is create parquet file as single block without changing the file system
default block size.  For ex. default Parquet block size is 512MB and if file system block
size is 128MB, we create single file with 4 blocks on filesystem, which can get distributed
on different nodes, not good for performance. If we change Parquet block size to 128MB (to
match with file system block size), for same amount of data, we end up creating 4 files, one
block each, which is not good either. 
    
    JIRA wanted single HDFS block per Parquet file that is larger than file system block size
, without changing file system block size.  They had file system block size configured as
128MB. Lowering parquet block size (from default value of 512MB) to match with file system
block size is creating too many files for them. For some other reasons, they are not able
to change file system block size. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message