drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Moving directory based pruning to fire earlier
Date Tue, 24 Nov 2015 02:51:51 GMT
You can carry on using your own formula, but move the formula into a metadata provider. You
just don’t need to create a subclass in order for it to get called. For example, if you’ve
written 

  public class DrillLogicalFilter extends LogicalFilter {
    public double getRows() {
      return <<my formula>>;
    }
  }

and getRows() is its only method you can obsolete it and register the following metadata provider:

  public class DrillMdRowCount {
    public Double getRowCount(LogicalFilter filter) {
      return <<my formula>>;
    }
  }

Calcite uses double dispatch (dispatching to a method based the provider AND its first argument
type) so the method will be called automatically.

Julian



> On Nov 23, 2015, at 5:56 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:
> 
> My understanding is RelMetadataProvider gives the estimation of row
> count, distinct row count, etc. But it's still up to each Rel node to
> decide how to estimate it's own cost, given the row count, distinct
> row count etc from MetadataProvider. Are you suggesting we completely
> remove the Drill's costing estimation method, and use Calcite's
> default one?
> 
> 
> 
> On Mon, Nov 23, 2015 at 5:35 PM, Julian Hyde <jhyde@apache.org> wrote:
>> Yes. You don’t need an “implement” method (or yours can just throw).
>> 
>> You could use your own serialization to/from JSON or you could use RelJsonWriter/RelJsonReader.
>> 
>> Julian
>> 
>> 
>>> On Nov 23, 2015, at 5:31 PM, Jacques Nadeau <jacques@dremio.com> wrote:
>>> 
>>> We could create serializers and deserializers for the logical plan stuff.
>>> It looks like we can resolve the costing through metadata providers unless
>>> I misunderstood what Julian was suggesting.
>>> 
>>> 
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> On Mon, Nov 23, 2015 at 5:12 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:
>>> 
>>>> @Jacaues,
>>>> 
>>>> Every DrillLogicalRel has to override computeSelfCost(), and implement
>>>> implement() method. The latter is to get Logical Plan, which is one of
>>>> three input types Drill should accept (SQL, Logical Plan, Physical
>>>> Plan).
>>>> 
>>>> So, for now, we do have to override/exend all DrillLogicalRel.
>>>> 
>>>> 
>>>> On Mon, Nov 23, 2015 at 4:55 PM, Julian Hyde <jhyde@apache.org> wrote:
>>>>> I’m not sure what properties / behavior you want to override but
>>>> remember that Calcite specifies a lot of brings as traits or metadata.
>>>>> 
>>>>> For example, “double RelNode.getRows()" is deprecated and you would
>>>> these days use RelMetadataQuery.getRowCount(). You would not need to
>>>> sub-class a RelNode to override its row-count estimate, just supply a
>>>> different metadata provider.
>>>>> 
>>>>> Julian
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2015, at 4:50 PM, Jacques Nadeau <jacques@dremio.com>
wrote:
>>>>>> 
>>>>>> Yes, my suggestion is removal of DRILL_LOGICAL. @Hsuan, this is
>>>> independent
>>>>>> from the number of phases and I'm not suggesting changing that.
>>>>>> 
>>>>>> My main thought was: if we only need to override one or two rels,
do
>>>> only
>>>>>> that rather than having a wholesale copy of every operator with a
bunch
>>>> of
>>>>>> basic noop rules.
>>>>>> 
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>> 
>>>>>> On Mon, Nov 23, 2015 at 4:37 PM, Jinfeng Ni <jinfengni99@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> @Jacques, are you talking about removing the convention DRILL_LOGICAL?
>>>>>>> 
>>>>>>> DrillRel extends Calcite's LogialRel. It overrides some LogicalRel's
>>>>>>> methods, and adds new methods.  Therefore, even we remove
>>>>>>> DRILL_LOGICAL convention, we still have to maintain a set of
extended
>>>>>>> class from Calcite Logical. I'm not clear what benefit we would
get by
>>>>>>> removing the DRILL_LOGICAL convention.
>>>>>>> 
>>>>>>> If we want to remove the complete set of DrillLogical classes,
then
>>>>>>> I'm not sure where we put the Drill specific logic, for instance,
>>>>>>> Drill Join has certain restriction different from Calcite Join.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Nov 23, 2015 at 4:11 PM, Hsuan Yi Chu <hyichu@maprtech.com>
>>>> wrote:
>>>>>>>> My understanding is:
>>>>>>>> In logical planning, we determine the "structure" of the
tree (e.g.,
>>>> join
>>>>>>>> order)
>>>>>>>> And then in physical, we determine the implementation (e.g.,
merge vs
>>>>>>> hash
>>>>>>>> join).
>>>>>>>> 
>>>>>>>> This staging seems clean to me. So what is the motivation
to merge
>>>> them
>>>>>>> all
>>>>>>>> together?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Nov 23, 2015 at 2:51 PM, Jacques Nadeau <jacques@dremio.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Anybody think we should just get rid of Drels (Rel >
Drel > Prel) and
>>>>>>> use
>>>>>>>>> Calcite's logical representation directly (Rel > Prel)?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 23, 2015 at 1:57 PM, Mehant Baid <baid.mehant@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Currently all rules based on Calcite logical rels
and Drill logical
>>>>>>> rels
>>>>>>>>>> are put together and are fired together. As part
of DRILL-3996,
>>>>>>> Jinfeng
>>>>>>>>>> will break it down into different phases. I should
be able to take
>>>>>>>>>> advantage of this and move the directory based partition
pruning to
>>>>>>> fire
>>>>>>>>>> based on Calcite rels.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Mehant
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 11/23/15 10:58 AM, Hanifi GUNES wrote:
>>>>>>>>>> 
>>>>>>>>>>> The general idea of multi-phase pruning makes
sense to me. I am
>>>>>>>>> wondering,
>>>>>>>>>>> though, are we referring to introducing a new
planning phase before
>>>>>>> the
>>>>>>>>>>> logical or separating out the logic so as to
make directory pruning
>>>>>>> kick
>>>>>>>>>>> off ahead of column partitioning?
>>>>>>>>>>> 
>>>>>>>>>>> 2015-11-23 10:33 GMT-08:00 Mehant Baid <baid.mehant@gmail.com>:
>>>>>>>>>>> 
>>>>>>>>>>> As part of DRILL-3996 <
>>>>>>> https://issues.apache.org/jira/browse/DRILL-3996
>>>>>>>>>> 
>>>>>>>>>>>> Jinfeng mentioned that he plans to move the
directory based
>>>> pruning
>>>>>>>>> rule
>>>>>>>>>>>> earlier than column based pruning. I want
to expand on that a
>>>>>>> little,
>>>>>>>>>>>> provide the motivation and gather thoughts/
feedback.
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently both the directory based pruning
and the column based
>>>>>>> pruning
>>>>>>>>>>>> is
>>>>>>>>>>>> fired in the same planning phase and are
based on Drill logical
>>>>>>> rels.
>>>>>>>>>>>> This
>>>>>>>>>>>> is not optimal in the case where data is
organized in such a way
>>>>>>> that
>>>>>>>>>>>> both
>>>>>>>>>>>> directory based pruning and column based
pruning can be applied
>>>>>>> (when
>>>>>>>>> the
>>>>>>>>>>>> data is organized with a nested directory
structure plus the
>>>>>>> individual
>>>>>>>>>>>> files contain partition columns). As part
of creating the Drill
>>>>>>> logical
>>>>>>>>>>>> scan we read the footers of all the files
involved. If the
>>>> directory
>>>>>>>>>>>> based
>>>>>>>>>>>> pruning rule is fired earlier (rule to fire
based on calcite
>>>> logical
>>>>>>>>>>>> rels)
>>>>>>>>>>>> then we will be able to prune out unnecessary
directories and save
>>>>>>> the
>>>>>>>>>>>> work
>>>>>>>>>>>> of reading the footers of these files.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Mehant
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 


Mime
View raw message