drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size
Date Wed, 17 May 2017 01:57:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013380#comment-16013380

ASF GitHub Bot commented on DRILL-5379:

Github user ppadma commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
    @@ -380,14 +384,21 @@ public void endRecord() throws IOException {
           // since ParquetFileWriter will overwrite empty output file (append is not supported)
           // we need to re-apply file permission
    -      parquetFileWriter = new ParquetFileWriter(conf, schema, path, ParquetFileWriter.Mode.OVERWRITE);
    +      if (useConfiguredBlockSize) {
    --- End diff --
    What we are doing is create parquet file as single block without changing the file system
default block size.  For ex. default Parquet block size is 512MB and if file system block
size is 128MB, we create single file with 4 blocks on filesystem, which can get distributed
on different nodes, not good for performance. If we change Parquet block size to 128MB (to
match with file system block size), for same amount of data, we end up creating 4 files, one
block each, which is not good either. 
    JIRA wanted single HDFS block per Parquet file that is larger than file system block size
, without changing file system block size.  They had file system block size configured as
128MB. Lowering parquet block size (from default value of 512MB) to match with file system
block size is creating too many files for them. For some other reasons, they are not able
to change file system block size. 

> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F M├ęthot
>             Fix For: Future
> It seems there a way to force Drill to store CTAS generated parquet file as a single
block when using HDFS. Java HDFS API allows to do that, files could be created with the Parquet
block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the Configuration
object before passing it to ParquetFileWriter constructor?

This message was sent by Atlassian JIRA

View raw message