drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <asi...@maprtech.com>
Subject Re: Directory and file based partition pruning
Date Fri, 11 Sep 2015 01:25:15 GMT
Yes, it is a good point about multiple invocations of the PruneScan rule.
The other point about using Java heap is not correct.  The rule does
off-heap allocation using memory buffer from QueryContext and in the
finally block releases the memory.

Aman

On Thu, Sep 10, 2015 at 6:18 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:

> I opened DRILL-3765 for the multiple rule execution issue:
>
> https://issues.apache.org/jira/browse/DRILL-3765
>
>
> On Thu, Sep 10, 2015 at 5:34 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:
> > Seems to me one important reason we hit out of heap memory for partition
> > prune rule is that the rule itself is invoked multiple times, even the
> > filter has been pushed into scan in the first call.
> >
> > I tried with a simple unit test
> > TestPartitionFilter:testPartitionFilter1_Parquet_from_CTAS(), here is
> the #
> > of frequency of partition rules that are fired in Calcite trace
> >
> >  #_rule_fire,  rule name
> >
> >  4 [PruneScanRule:Filter_On_Project_Parquet]
> >  4 [PruneScanRule:Filter_On_Project]
> >
> >  2 [PruneScanRule:Filter_On_Scan_Parquet]
> >  2 [PruneScanRule:Filter_On_Scan]
> >
> > Setting a breaking point in PruneScanRule where it calls the interpreter
> to
> > evaluate the expression, I could see that the code stops 6 times in that
> > point; meaning that Drill will have to build the vector containing the
> > filenames at least 6 times.  That would cause lots of heap memory
> > consumption, if gc does not kick in to release the memory used in the
> prior
> > rule's execution.
> >
> > I think making the partition pruning multiple phases will help to reduce
> the
> > memory consumption. But for now, it seems important to avoid the repeated
> > and unnecessary rule execution.
> >
> >
> >
> >
> >
> > On Thu, Sep 10, 2015 at 4:42 PM, Aman Sinha <asinha@maprtech.com> wrote:
> >>
> >> Agree on the N phased approach.  I have filed a JIRA for the
> enhancement:
> >>  DRILL-3759.
> >> Regarding the simplification of the expression tree logic..did you mean
> >> the
> >> logic in FindPartitionConditions  or the Interpreter ?
> >> Perhaps you can add comments in the JIRA with some explanation.  I am in
> >> favor of simplification where possible.
> >>
> >> On Wed, Sep 9, 2015 at 10:39 PM, Jacques Nadeau <jacques@dremio.com>
> >> wrote:
> >>
> >> > Makes sense.
> >> >
> >> > Is there we can do this with lazy materializations rather than writing
> >> > complex expression tree logic? I hate have no all this custom
> expression
> >> > tree manipulation logic.
> >> >
> >> > Also, it seems like this should be N phased rather than two phase
> where
> >> > N
> >> > is the number of directories below the base path.
> >> >
> >> > Thoughts?
> >> > On Sep 9, 2015 10:54 AM, "Aman Sinha" <amansinha@apache.org> wrote:
> >> >
> >> > > Currently, partition pruning gets all file names in the table and
> >> > > applies
> >> > > the pruning.  Suppose the files are spread out over several
> >> > > directories
> >> > and
> >> > > there is a filter  on dirN,  this is not efficient - both in terms
> of
> >> > > elapsed time and memory usage.  This has been seen in a few use
> cases
> >> > > recently.
> >> > >
> >> > > We should ideally perform the pruning in 2 steps:  first get the
> >> > top-level
> >> > > directory names only and apply the directory filter, then get the
> >> > filenames
> >> > > within that directory and apply remaining filters.
> >> > >
> >> > > I will create a JIRA for this enhancement but let me know your
> >> > thoughts...
> >> > >
> >> > > Aman
> >> > >
> >> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message