ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eugene miretsky <eugene.miret...@gmail.com>
Subject Re: Query 3x slower with index
Date Wed, 19 Sep 2018 18:26:54 GMT
Hi Ilya,

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the
first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)

   1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID:
   CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
   2. The query runs slower.

For the second query (join on an inlined table) the behaviour is very
similar. Using the first 3 indexes results in the same plan - using  /*
PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID
*/.
When using #4 (alone or in combination with any of the other 3)

   1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID =
   CATS__Z1.CATEGORY_ID */ are used
   2. The query is much slower.


Theoretically the query seems pretty simple

   1. Use affinity key  to make sure the query runs in parallel and there
   are no shuffles
   2. Filter rows that match category_id using the category_id index
   3. Used customer_id index for the group_by (not sure if this step makes
   sense)

But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
wrote:

> Hello!
>
> I can see you try to use _key_PK as index. If your primary key is
> composite, it won't work properly for you. I recommend creating an explicit
> (category_id, customer_id) index.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <eugene.miretsky@gmail.com>:
>
>> Hi Ilya,
>>
>> The different query result was my mistake - one of the categoy_ids was
>> duplicate, so in the query that used join, it counted rows for that
>> category twice. My apologies.
>>
>> However, we are still having an issue with query time, and the index not
>> being applied to category_id. Would appreciate if you could take a look.
>>
>> Cheers,
>> Eugene
>>
>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <
>> ilya.kasnacheev@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> Why don't you diff the results of those two queries, tell us what the
>>> difference is?
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <eugene.miretsky@gmail.com
>>> >:
>>>
>>>> Hello,
>>>>
>>>> Just wanted to see if anybody had time to look into this.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Thanks!
>>>>>
>>>>> Tried joining with an inlined table instead of IN as per the second
>>>>> suggestion, and it didn't quite work.
>>>>>
>>>>> Query1:
>>>>>
>>>>>    - Select COUNT(*) FROM( Select customer_id from GATABLE3  use
>>>>>    Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450,
>>>>>    6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>>>    SUM(product_clicks_app) > 1 )
>>>>>    - exec time = 17s
>>>>>    - *Result: 3105868*
>>>>>    - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
>>>>>    customer_id index
>>>>>    - Using an index on category_id increases the query time 33s
>>>>>
>>>>> Query2:
>>>>>
>>>>>    - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use
>>>>>    index (PUBLIC."_key_PK") inner join table(category_id int = (9005,
175930,
>>>>>    175930, 175940,175945,101450, 6453)) cats on cats.category_id =
>>>>>    ga.category_id   group by customer_id having SUM(product_views_app)
> 2 OR
>>>>>    SUM(product_clicks_app) > 1 )
>>>>>    - exec time = 38s
>>>>>    - *Result: 3113921*
>>>>>    - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
>>>>>    customer_id index or category_id index
>>>>>    - Using an index on category_id doesnt change the run time
>>>>>
>>>>> Query plans are attached.
>>>>>
>>>>> 3 questions:
>>>>>
>>>>>    1. Why is the result differnt for the 2 queries - this is quite
>>>>>    concerning.
>>>>>    2. Why is the 2nd query taking longer
>>>>>    3. Why  category_id index doesn't work in case of query 2.
>>>>>
>>>>>
>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <
>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> I don't think that we're able to use index with IN () clauses. Please
>>>>>> convert it into OR clauses.
>>>>>>
>>>>>> Please see
>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <
>>>>>> andrey.mashenkov@gmail.com>:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Actually, first query uses index on affinity key which looks
more
>>>>>>> efficient than index on category_id column.
>>>>>>> The first query can process groups one by one and stream partial
>>>>>>> results from map phase to reduce phase as it use sorted index
lookup,
>>>>>>> while second query should process full dataset on map phase before
>>>>>>> pass it for reducing.
>>>>>>>
>>>>>>> Try to use composite index (customer_id, category_id).
>>>>>>>
>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite
to
>>>>>>> build more efficient plan when group by on collocated column
is used.
>>>>>>>
>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Schema:
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.CUSTOMER_ID
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.DT
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.CATEGORY_ID
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.VERTICAL_ID
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.SERVICE
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.PDP_SESSIONS_APP
>>>>>>>>
>>>>>>>>    PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>>>>>>>>    - pkey = customer_id,dt
>>>>>>>>    - affinityKey = customer
>>>>>>>>
>>>>>>>> Query:
>>>>>>>>
>>>>>>>>    - select COUNT(*) FROM( Select customer_id from GATABLE2
where
>>>>>>>>    category_id in (175925, 101450, 9005, 175930, 175930,
175940,175945,101450,
>>>>>>>>    6453) group by customer_id having SUM(product_views_app)
> 2 OR
>>>>>>>>    SUM(product_clicks_app) > 1 )
>>>>>>>>
>>>>>>>> The table has 600M rows.
>>>>>>>> At first, the query took 1m, when we added an index on category_id
>>>>>>>> the query started taking 3m.
>>>>>>>>
>>>>>>>> The SQL execution plan for both queries is attached.
>>>>>>>>
>>>>>>>> We are using a single x1.16xlarge insntace with query parallelism
>>>>>>>> set to 32
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Andrey V. Mashenkov
>>>>>>>
>>>>>>

Mime
View raw message