drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From François Méthot <fmetho...@gmail.com>
Subject Re: Single Hdfs block per parquet file
Date Fri, 24 Mar 2017 13:45:12 GMT
Done,
Thanks for the feedback

https://issues.apache.org/jira/browse/DRILL-5379


On Thu, Mar 23, 2017 at 4:29 PM, Kunal Khatua <kkhatua@mapr.com> wrote:

> This seems like a reasonable feature request. It could also be expanded to
> detect the underlying block size for the location being written to.
>
>
> Could you file a JIRA for this?
>
>
> Thanks
>
> Kunal
>
> ________________________________
> From: François Méthot <fmethot78@gmail.com>
> Sent: Thursday, March 23, 2017 9:08:51 AM
> To: dev@drill.apache.org
> Subject: Re: Single Hdfs block per parquet file
>
> After further investigation, Drill uses the hadoop ParquetFileWriter (
> https://github.com/Parquet/parquet-mr/blob/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
> ).
> This is where the file creation occurs so it might be tricky after all.
>
> However ParquetRecordWriter.java (
> https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/
> ParquetRecordWriter.java)
> in Drill creates the ParquetFileWriter with an hadoop configuration object.
>
> However something to explore: Could the block size be set as a property
> within the Configuration object before passing it to ParquetFileWriter
> constructor?
>
> François
>
> On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <ppenumarthy@mapr.com>
> wrote:
>
> > Yes, seems like it is possible to create files with different block
> sizes.
> > We could potentially pass the configured store.parquet.block-size to the
> > create call.
> > I will try it out and see. will let you know.
> >
> > Thanks,
> > Padma
> >
> >
> > > On Mar 22, 2017, at 4:16 PM, François Méthot <fmethot78@gmail.com>
> > wrote:
> > >
> > > Here are 2 links I could find:
> > >
> > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> > fs.Path,%20boolean,%20int,%20short,%20long)
> > >
> > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> > fs.Path,%20boolean,%20int,%20short,%20long)
> > >
> > > Francois
> > >
> > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <
> ppenumarthy@mapr.com>
> > > wrote:
> > >
> > >> I think we create one file for each parquet block.
> > >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> > >> 128MB,
> > >> it will create more blocks on HDFS.
> > >> Can you let me know what is the HDFS API that would allow you to
> > >> do otherwise ?
> > >>
> > >> Thanks,
> > >> Padma
> > >>
> > >>
> > >>> On Mar 22, 2017, at 11:54 AM, François Méthot <fmethot78@gmail.com>
> > >> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is there a way to force Drill to store CTAS generated parquet file
> as a
> > >>> single block when using HDFS? Java HDFS API allows to do that, files
> > >> could
> > >>> be created with the Parquet block-size.
> > >>>
> > >>> We are using Drill on hdfs configured with block size of 128MB.
> > Changing
> > >>> this size is not an option at this point.
> > >>>
> > >>> It would be ideal for us to have single parquet file per hdfs block,
> > >> setting
> > >>> store.parquet.block-size to 128MB would fix our issue but we end up
> > with
> > >> a
> > >>> lot more files to deal with.
> > >>>
> > >>> Thanks
> > >>> Francois
> > >>
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message