druid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anastasia Braginsky <anas...@oath.com.INVALID>
Subject Re: A question about Druid design
Date Wed, 13 Jun 2018 13:56:17 GMT
 Hi Everyone,
Could I, please, call for your attention?The Oak project is on, and I would like to join the
next weekly video meeting to present our progress.However, we are still in doubt regarding
Rollup- vs Plain- FactsHolder. Could someone please read the email chain bellow and help with
some answer? Or should it better be discussed in the meeting?


    On Thursday, May 31, 2018, 6:40:12 PM GMT+3, Anastasia Braginsky <anastas@oath.com>
  Hi Gian,
Thanks for the explanations! 
I have one more question:

You say that 
"...the RollupFactsHolder there will be a _single_ fact row per TimeAndDims... But with the
PlainFactsHolder there may be more than one fact row per TimeAndDims..."In PlainFactsHolder
we have more than one fact row per Timestamp actually, or am I missing something? I mean in
RollupFactsHolder could you scan only TimeAndDims (leading to rows) with some Timestamp and
get the same result? Is it true that TimeAndDims are ordered firstly according to time anyway?
I am most likely missing something, just would like to understand what :)

    On Wednesday, May 30, 2018, 10:56:26 AM GMT+3, Gian Merlino <gianmerlino@gmail.com>
 Hi Anastasia,

1) At ingestion time the FactsHolder is sorted. The unsorted code path is
used by groupBy v1, which hasn't been common since groupBy v2 was made the
default a few releases ago. So I would only worry about the sorted case.

2) PlainFactsHolder is used when the user has disabled rollup at ingestion
time. The idea is that with the RollupFactsHolder there will be a _single_
fact row per TimeAndDims (and Druid may combine multiple input rows into
one indexed fact row). But with the PlainFactsHolder there may be more than
one fact row per TimeAndDims (in particular: there will be one fact row per
input row).

Hope this helps.

On Wed, May 30, 2018 at 12:14 AM, Anastasia Braginsky <
anastas@oath.com.invalid> wrote:

> Hi,
> Recall our suggestion to use the new concurrent map named Oak as a base
> for Incremental Index. Oak stands for Off-heap Allocated Keys, for more
> details please see issue #5698. We had a great progress with Oak
> integration and stabilizing OakIndex performance. We have some questions
> regarding FactsHolder. As we explained in our design document and
> refactoring suggestion we prefer to remove the FactsHolder usage in
> the OakIndex, because Oak maps the keys (Time&Dims) to the values
> (Aggregators) directly. Therefore the Oak mapping is always sorted and only
> from keys to values. From here we have two questions.
> 1. Unsorted FactsHolder: It is understandable that unsorted mapping via
> HashMap (O(1) access) might be faster than sorted mapping (O(logN) access).
> The question is whether the unsorted variant used frequently? When it is
> used? And is it acceptable that in this case Oak will give slightly lower
> performance?
> 2. Regarding Plain- vs Rollup- FactsHolder: It can be seen that
> PlainFactsHolder is holding a queue of Key->Value (Time&Dims->Aggregator)
> per Timestamp, where the sorting is via Timestamp. Therefore, Oak
> implements mostly sorted RollupFactsHolder logic. Additionally, Timestamp
> is also a part of TIme&Dims and the sorting is initially according to
> Timestamp, then other dimensions. The question is what are the use-cases
> where the PlainFactsHolder and not Rollup is used? And is there any
> functionality that can be given by Plain but not by Rollup?
> Thanks,Anastasia
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message