druid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gian Merlino <g...@apache.org>
Subject Re: A question about Druid design
Date Tue, 19 Jun 2018 15:55:41 GMT
Hi Anastasia,

Sorry for the delay in getting back to you. You're right that the
PlainFactsHolder is indexed by timestamp, not by TimeAndDims -- earlier I
was answering from memory and not from actually looking at the code! Shows
what I get for doing that.

The idea with RollupFactsHolder is that at ingestion time we are doing a
group-by time (truncated based on queryGranularity) and dimensions, as
described in the "roll up" section here:
http://druid.io/docs/latest/design/index.html#roll-up. So there will be
only one row per TimeAndDims (since we're aggregating input rows using
TimeAndDims as a key). And the idea with PlainFactsHolder is that we aren't
doing any rollup at all, we're just storing one row in Druid corresponding
to one row in the input. IIRC the only reason we have a map in that case is
because we want to be able to quickly iterate the rows in time-sorted order
(query engines like timeseries depend on this ability).

On Wed, Jun 13, 2018 at 6:56 AM Anastasia Braginsky
<anastas@oath.com.invalid> wrote:

>  Hi Everyone,
> Could I, please, call for your attention?The Oak project is on, and I
> would like to join the next weekly video meeting to present our
> progress.However, we are still in doubt regarding Rollup- vs Plain-
> FactsHolder. Could someone please read the email chain bellow and help with
> some answer? Or should it better be discussed in the meeting?
> Thanks,Anastasia
>     On Thursday, May 31, 2018, 6:40:12 PM GMT+3, Anastasia Braginsky <
> anastas@oath.com> wrote:
>   Hi Gian,
> Thanks for the explanations!
> I have one more question:
> You say that
> "...the RollupFactsHolder there will be a _single_ fact row per
> TimeAndDims... But with the PlainFactsHolder there may be more than one
> fact row per TimeAndDims..."In PlainFactsHolder we have more than one fact
> row per Timestamp actually, or am I missing something? I mean in
> RollupFactsHolder could you scan only TimeAndDims (leading to rows) with
> some Timestamp and get the same result? Is it true that TimeAndDims are
> ordered firstly according to time anyway?
> I am most likely missing something, just would like to understand what :)
> Thanks,Anastasia
>     On Wednesday, May 30, 2018, 10:56:26 AM GMT+3, Gian Merlino <
> gianmerlino@gmail.com> wrote:
>  Hi Anastasia,
> 1) At ingestion time the FactsHolder is sorted. The unsorted code path is
> used by groupBy v1, which hasn't been common since groupBy v2 was made the
> default a few releases ago. So I would only worry about the sorted case.
> 2) PlainFactsHolder is used when the user has disabled rollup at ingestion
> time. The idea is that with the RollupFactsHolder there will be a _single_
> fact row per TimeAndDims (and Druid may combine multiple input rows into
> one indexed fact row). But with the PlainFactsHolder there may be more than
> one fact row per TimeAndDims (in particular: there will be one fact row per
> input row).
> Hope this helps.
> On Wed, May 30, 2018 at 12:14 AM, Anastasia Braginsky <
> anastas@oath.com.invalid> wrote:
> > Hi,
> > Recall our suggestion to use the new concurrent map named Oak as a base
> > for Incremental Index. Oak stands for Off-heap Allocated Keys, for more
> > details please see issue #5698. We had a great progress with Oak
> > integration and stabilizing OakIndex performance. We have some questions
> > regarding FactsHolder. As we explained in our design document and
> > refactoring suggestion we prefer to remove the FactsHolder usage in
> > the OakIndex, because Oak maps the keys (Time&Dims) to the values
> > (Aggregators) directly. Therefore the Oak mapping is always sorted and
> only
> > from keys to values. From here we have two questions.
> >
> > 1. Unsorted FactsHolder: It is understandable that unsorted mapping via
> > HashMap (O(1) access) might be faster than sorted mapping (O(logN)
> access).
> > The question is whether the unsorted variant used frequently? When it is
> > used? And is it acceptable that in this case Oak will give slightly lower
> > performance?
> >
> > 2. Regarding Plain- vs Rollup- FactsHolder: It can be seen that
> > PlainFactsHolder is holding a queue of Key->Value (Time&Dims->Aggregator)
> > per Timestamp, where the sorting is via Timestamp. Therefore, Oak
> > implements mostly sorted RollupFactsHolder logic. Additionally, Timestamp
> > is also a part of TIme&Dims and the sorting is initially according to
> > Timestamp, then other dimensions. The question is what are the use-cases
> > where the PlainFactsHolder and not Rollup is used? And is there any
> > functionality that can be given by Plain but not by Rollup?
> >
> > Thanks,Anastasia
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message