ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eugene miretsky <eugene.miret...@gmail.com>
Subject Re: Query 3x slower with index
Date Thu, 11 Oct 2018 17:56:35 GMT
Thanks!

Could you please clarfiy "*In case of a composite index, it will apply the
columns one by one"? *

Igntie (or rather H2?) needs to load the data into heap in order to do the
groupBy & aggregations. We were hoping that only data that matches the
category filter will be loaded.
*What does one by one mean when: (assuming and index *(customer_id,
category_id)*) *

   1. *The fiilter is on both customer  and category. What data will be
   loaded into Heap?*
   2. *The fitler is only on **category, and the customer is just used for
   groupBy. Will Ignite*
      1. * load one customer with all the rows, and apply the category
      filter in heap*
      2.  *load one customer, but load only the rows that pass the category
      fitler in heap*
      3. *load all the events that pass the category filter, and then group
      them by customer. *

*From out benchmarking so far it seems like 1 is happening. *

On Thu, Oct 11, 2018 at 1:28 PM Stanislav Lukyanov <stanlukyanov@gmail.com>
wrote:

> Hi,
>
>
>
> It is a rather lengthy thread and I can’t dive into details right now,
>
> but AFAICS the issue now is making affinity key index to work with a
> secondary index.
>
> The important things to understand is
>
>    1. Ignite will only use one index per table
>    2. In case of a composite index, it will apply the columns one by one
>    3. The affinity key index should always go first as the first step is
>    splitting the query by affinity key values
>
>
>
> So, to use index over the affinity key (customer_id) and a secondary index
> (category_id) one needs to create an index
>
> like (customer_id, category_id), in that order, with no columns in between.
>
> Note that index (customer_id, dt, category_id) can’t be used instead of it.
>
> On the other hand, (customer_id, category_id, dt) can - the last part of
> the index will be left unused.
>
>
>
> Thanks,
>
> Stan
>
>
>
> *From: *eugene miretsky <eugene.miretsky@gmail.com>
> *Sent: *9 октября 2018 г. 19:40
> *To: *user@ignite.apache.org
> *Subject: *Re: Query 3x slower with index
>
>
>
> Hi Ilya,
>
>
>
> I have tried it, and got the same performance as forcing using category
> index in my initial benchmark - query is 3x slowers and uses only one
> thread.
>
>
>
> From my experiments so far it seems like Ignite can either (a) use
> affinity key and run queries in parallel, (b) use index but run the query
> on only one thread.
>
>
>
> Has anybody been able to run OLAP like queries in while using an index?
>
>
>
> Cheers,
>
> Eugene
>
>
>
> On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com> wrote:
>
> Hello!
>
>
>
> I guess that using AFFINITY_KEY as index have something to do with the
> fact that GROUP BY really wants to work per-partition.
>
>
>
> I have the following query for you:
>
>
>
> 1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select
> customer_id from (Select customer_id, product_views_app, product_clict_app
> from GA_DATA ga join table(category_id int = ( 117930, 175930,
> 175940,175945,101450)) cats on cats.category_id = ga.category_id) data
> group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clict_app) > 1);
> PLAN  SELECT
>     DATA__Z2.CUSTOMER_ID AS __C0_0,
>     SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
>     SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
> FROM (
>     SELECT
>         GA__Z0.CUSTOMER_ID,
>         GA__Z0.PRODUCT_VIEWS_APP,
>         GA__Z0.PRODUCT_CLICT_APP
>     FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945,
> 101450)) CATS__Z1
>     INNER JOIN PUBLIC.GA_DATA GA__Z0
>         ON 1=1
>     WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
> ) DATA__Z2
>     /* SELECT
>         GA__Z0.CUSTOMER_ID,
>         GA__Z0.PRODUCT_VIEWS_APP,
>         GA__Z0.PRODUCT_CLICT_APP
>     FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945,
> 101450)) CATS__Z1
>         /++ function ++/
>     INNER JOIN PUBLIC.GA_DATA GA__Z0
>         /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
>         ON 1=1
>     WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
>      */
> GROUP BY DATA__Z2.CUSTOMER_ID
>
> PLAN  SELECT
>     COUNT(*)
> FROM (
>     SELECT
>         __C0_0 AS CUSTOMER_ID
>     FROM PUBLIC.__T0
>     GROUP BY __C0_0
>     HAVING (SUM(__C0_1) > 2)
>         OR (SUM(__C0_2) > 1)
> ) _18__Z3
>     /* SELECT
>         __C0_0 AS CUSTOMER_ID
>     FROM PUBLIC.__T0
>         /++ PUBLIC."merge_scan" ++/
>     GROUP BY __C0_0
>     HAVING (SUM(__C0_1) > 2)
>         OR (SUM(__C0_2) > 1)
>      */
>
>
>
> However, I'm not sure it is "optimal" or not since I have no idea if it
> will perform better or worse on real data. That's why I need a subset of
> data which will make query execution speed readily visible. Unfortunately,
> I can't deduce that from query plan alone.
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> пн, 24 сент. 2018 г. в 16:14, eugene miretsky <eugene.miretsky@gmail.com>:
>
> An easy way to reproduce would be to
>
>
>
> 1. Create table
>
> CREATE TABLE GA_DATA (
>
>     customer_id bigint,
>
>     dt timestamp,
>
>     category_id int,
>
>     product_views_app int,
>
>     product_clict_app int,
>
>     product_clict_web int,
>
>     product_clict_web int,
>
>     PRIMARY KEY (customer_id, dt, category_id)
>
> ) WITH "template=ga_template, backups=0, affinityKey=customer_id";
>
>
>
> 2. Create indexes
>
> ·         CREATE INDEX ga_customer_id ON GA_Data (customer_id)
>
> ·         CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)
>
> ·         CREATE INDEX ga_category_and_customer_id ON GA_Data
> (category_id, customer_id)
>
> ·         CREATE INDEX ga_category_id ON GA_Data (category_id)
>
> 3. Run Explain on the following queries while trying forcing using
> different indexes
>
> ·         Select count(*) FROM(
>
> Select customer_id from GA_DATA  use index (ga_category_id)
>
> where category_id in (117930, 175930, 175940,175945,101450)
>
> group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1 )
>
>
>
> ·         Select count(*) FROM(
>
>     Select customer_id from GA_DATA ga use index (ga_pKey)
>
>     join table(category_id int = ( 117930, 175930, 175940,175945,101450))
> cats on cats.category_id = ga.category_id
>
>     group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1
>
> )
>
>
>
> The execution plans will be similar to what I have posted earler. In
> particular, only on of (a) affinty key index, (b) category_id index will be
> used.
>
>
>
> On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> wrote:
>
> Hello!
>
>
>
> Can you share a reproducer project which loads (or generates) data for
> caches and then queries them? I could try and debug it if I had the
> reproducer.
>
>
>
> Regards.
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> чт, 20 сент. 2018 г. в 21:05, eugene miretsky <eugene.miretsky@gmail.com>:
>
> Thanks Ilya,
>
>
>
> Tried it, no luck. It performs the same as when using category_id index
> alone (slow).
>
>   Any combindation I try either uses AFFINITY_KEY or category index. When
> it uses category index it runs slowers.
>
>
>
> Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query
> parallelism settings ) when category_id is used, the jobs runs on one
> thread most of the time (first few seconds it looks like more threads are
> doing work).
>
>
>
> Please help on this. It seems like a very simple use case (using affinity
> key and another index), either I am doing something extremly silly, or I
> stumbled on a bug in Ignite that's effecting a lot of people.
>
>
>
> Cheers,
>
> Eugene
>
>
>
> On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> wrote:
>
> Hello!
>
>
>
> > 2) ga_customer_and_category_id: on customer_id and category_id
>
>
>
> Have you tried to do an index on category_id first, customer_id second?
> Note that Ignite will use only one index when joining two tables and that
> in your case it should start with category_id.
>
>
>
> You can also try adding affinity key to this index in various places, see
> if it helps further.
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> ср, 19 сент. 2018 г. в 21:27, eugene miretsky <eugene.miretsky@gmail.com>:
>
> Hi Ilya,
>
>
>
> I created 4 indexs on the table:
>
> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key
> columns)
>
> 2) ga_customer_and_category_id: on customer_id and category_id
>
> 2) ga_customer_id: on customer_id
>
> 4) ga_category_id: on category_id
>
>
>
>
>
> For the first query (category in ()), the execution plan when using the
> first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
>
> When using #4 (alone or in combination with any of the other 3)
>
> 1.       /* PUBLIC.AFFINITY_KEY */ is replaced with  /*
> PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945,
> 101450) */
>
> 2.       The query runs slower.
>
> For the second query (join on an inlined table) the behaviour is very
> similar. Using the first 3 indexes results in the same plan - using  /*
> PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID
> */.
>
> When using #4 (alone or in combination with any of the other 3)
>
> 1.       /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID =
> CATS__Z1.CATEGORY_ID */ are used
>
> 2.       The query is much slower.
>
>
>
> Theoretically the query seems pretty simple
>
> 1.       Use affinity key  to make sure the query runs in parallel and
> there are no shuffles
>
> 2.       Filter rows that match category_id using the category_id index
>
> 3.       Used customer_id index for the group_by (not sure if this step
> makes sense)
>
> But I cannot get it to work.
>
>
>
> Cheers,
>
> Eugene
>
>
>
>
>
>
>
>
>
> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com> wrote:
>
> Hello!
>
>
>
> I can see you try to use _key_PK as index. If your primary key is
> composite, it won't work properly for you. I recommend creating an explicit
> (category_id, customer_id) index.
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <eugene.miretsky@gmail.com>:
>
> Hi Ilya,
>
>
>
> The different query result was my mistake - one of the categoy_ids was
> duplicate, so in the query that used join, it counted rows for that
> category twice. My apologies.
>
>
>
> However, we are still having an issue with query time, and the index not
> being applied to category_id. Would appreciate if you could take a look.
>
>
>
> Cheers,
>
> Eugene
>
>
>
> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> wrote:
>
> Hello!
>
>
>
> Why don't you diff the results of those two queries, tell us what the
> difference is?
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <eugene.miretsky@gmail.com>:
>
> Hello,
>
>
>
> Just wanted to see if anybody had time to look into this.
>
>
>
> Cheers,
>
> Eugene
>
>
>
> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <eugene.miretsky@gmail.com>
> wrote:
>
> Thanks!
>
>
>
> Tried joining with an inlined table instead of IN as per the second
> suggestion, and it didn't quite work.
>
>
>
> Query1:
>
> ·         Select COUNT(*) FROM( Select customer_id from GATABLE3  use
> Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450,
> 6453) group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1 )
>
> ·         exec time = 17s
>
> ·         *Result: 3105868*
>
> ·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
> customer_id index
>
> ·         Using an index on category_id increases the query time 33s
>
> Query2:
>
> ·         Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use
> index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930,
> 175930, 175940,175945,101450, 6453)) cats on cats.category_id =
> ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1 )
>
> ·         exec time = 38s
>
> ·         *Result: 3113921*
>
> ·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
> customer_id index or category_id index
>
> ·         Using an index on category_id doesnt change the run time
>
> Query plans are attached.
>
>
>
> 3 questions:
>
> 1.       Why is the result differnt for the 2 queries - this is quite
> concerning.
>
> 2.       Why is the 2nd query taking longer
>
> 3.       Why  category_id index doesn't work in case of query 2.
>
>
>
> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> wrote:
>
> Hello!
>
>
>
> I don't think that we're able to use index with IN () clauses. Please
> convert it into OR clauses.
>
>
>
> Please see
> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <andrey.mashenkov@gmail.com
> >:
>
> Hi
>
>
>
> Actually, first query uses index on affinity key which looks more
> efficient than index on category_id column.
>
> The first query can process groups one by one and stream partial results
> from map phase to reduce phase as it use sorted index lookup,
>
> while second query should process full dataset on map phase before pass it
> for reducing.
>
>
>
> Try to use composite index (customer_id, category_id).
>
>
>
> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build
> more efficient plan when group by on collocated column is used.
>
>
>
> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <eugene.miretsky@gmail.com>
> wrote:
>
> Hello,
>
>
>
> Schema:
>
> ·
>
> PUBLIC.GATABLE2.CUSTOMER_ID
>
> PUBLIC.GATABLE2.DT
>
> PUBLIC.GATABLE2.CATEGORY_ID
>
> PUBLIC.GATABLE2.VERTICAL_ID
>
> PUBLIC.GATABLE2.SERVICE
>
> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>
> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>
> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>
> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>
> PUBLIC.GATABLE2.PDP_SESSIONS_APP
>
> PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>
> ·         pkey = customer_id,dt
>
> ·         affinityKey = customer
>
> Query:
>
> ·         select COUNT(*) FROM( Select customer_id from GATABLE2 where
> category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450,
> 6453) group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1 )
>
> The table has 600M rows.
>
> At first, the query took 1m, when we added an index on category_id the
> query started taking 3m.
>
>
>
> The SQL execution plan for both queries is attached.
>
>
>
> We are using a single x1.16xlarge insntace with query parallelism set to
> 32
>
>
>
> Cheers,
>
> Eugene
>
>
>
>
>
>
> --
>
> Best regards,
> Andrey V. Mashenkov
>
>
>

Mime
View raw message