asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: Indexes not performing well
Date Tue, 31 May 2016 05:11:56 GMT
I just searched the related file in the codebase. It looks like the hint
that Mike referred is "skip-index" hint. And it looks like it only applies
to any secondary index. @Mike, is this what you are looking for?

An example (skip-secondary-btree-index-2.aql) -

create type TestType as open {

    id : int32,

    fname : string,

    lname : string

}


create dataset testdst(TestType) primary key id;


create index sec_Idx on testdst(fname);


for $emp in dataset('testdst')

where $emp.fname /*+ skip-index */ >= "Max" and $emp.fname <= "Roger"

return $emp

Best,
Taewoo

On Mon, May 30, 2016 at 10:03 PM, Michael Carey <mjcarey@ics.uci.edu> wrote:

> Guys,
>
> Reminder:  There's also an option to give a hint in the query to ignore a
> given index (you associate the hint with the predicate), which might be
> simpler for query re-execution sans indices.  (Can someone on the dev list
> add a pointer to the place where this is documented, albeit poorly in all
> likelihood?  I couldn't find it in a quick search just now, but I know it's
> hiding there...)
>
> Cheers,
> Mike
>
> On 5/30/16 9:25 PM, Pouria Pirzadeh wrote:
>
> Hi Magnus,
>
> Thanks for spending time to gather useful information and sharing them
> with us.
> I guess the first potential performance pitfall that you may be falling
> into is the large selectivity of the filters you are using against indexed
> fields, as Sattam also pointed out.
>
> 1) With respect to the attached plans to the previous email, did you add
> the plans before removing indices or after that ? Because in the plan for
> Q2, I can still see that the hour index (which you mentioned you dropped)
> is being used at the bottom.
>
> 2) If it is not too painful for you, can you also try two sets of
> additional queries (their repose time could be helpful):
>
>    - Running a simple count query (without any filter) on the dataset (I
>    just wanna figure out what is the total time for a pure full scan).
>    - Running those queries that do not benefit from secondary indices
>    (those with similar times across different versions) without *any* index.
>    This basically can be done by simply creating a new dataset without any
>    secondary index, loading your data into it (you can either use bulk load to
>    load from raw data file(s), OR an "insert into" query to load from your
>    exiting dataset to the new one) and then running those queries against the
>    new dataset. I just wanna find out if ignoring all the indices makes a
>    difference or not.
>
>
> Thanks.
> Pouria
>
>
> On Mon, May 30, 2016 at 6:08 AM, Magnus Kongshem <kongshem@stud.ntnu.no>
> wrote:
>
>> Attached you will find:
>>
>> - My DLL
>> - My queries and their optimized logical plan
>>
>> From what I can interpret, the queries does not use any of the combined
>> indexes. And query number two, three and four uses the "hour" index which
>> is based on the hourOfDay field. As Sattam states, indexes based on this
>> field may not be optimal. I deleted that index as well as the "day" index
>> and re-ran the queries. The result was that the queries used the "stamp"
>> index instead with the same poor execution time.
>>
>> I'm having some trouble creating a dataset with a filter. I keep getting:
>>
>> type mismatch: missing a required closed field uid:uuid [IllegalStateException]
>>
>> use dataverse bigd;
>>
>>     create type table2 as closed {
>> uid: uuid,
>>         campus: string,
>> building: string,
>> floor: string,
>>         timestamp: int32,
>> dayOfWeek: int32,
>> hourOfDay: int32,
>>         latitude: double,
>>         salt_timestamp: int32,
>>         longitude: double,
>>         id: string,
>> accuracy: double
>>     }
>> create dataset posdata3(table2)
>> primary key uid autogenerated with filter on timestamp;
>>
>> load dataset posdata3 using localfs
>> (("path"="path/to/file/all.adm"),("format"="adm"));
>>
>> BG,
>> Magnus
>>
>> On Mon, May 30, 2016 at 10:01 AM, Magnus Kongshem <
>> <kongshem@stud.ntnu.no>kongshem@stud.ntnu.no> wrote:
>>
>>> Thank you for your replies. I will test your thoughts by dropping the
>>> hourOfDay and dayOfWeek indexes, send you the optimized query plans and
>>> utilize the AsterixDB filters.
>>>
>>> I'll be back.
>>>
>>> BG,
>>> Magnus
>>>
>>> On Sun, May 29, 2016 at 8:32 AM, Sattam Alsubaiee <
>>> <salsubaiee@gmail.com>salsubaiee@gmail.com> wrote:
>>>
>>>> Creating indexes on fields with high selectivities (such as hourOfDay
>>>> and dayOfWeek) are not encouraged at all. Each secondary index lookup will
>>>> have to probe the primary index to fetch other fields in the record. It
>>>> would be much more efficient if you just perform scans as opposed of
>>>> accessing secondary indexes when querying such fields.
>>>>
>>>> I would recommend that you drop at least the following indexes:
>>>> drop index posdata.hour;
>>>> drop index posdata.day;
>>>>
>>>> Also I would highly recommend that you utilize AsterixDB filters, which
>>>> is very good optimization (could save up to 99% of query time) when you
>>>> deal with time-correlated fields such as timestamps:
>>>> <https://asterixdb.apache.org/docs/0.8.8-incubating/aql/filters.html>
>>>> https://asterixdb.apache.org/docs/0.8.8-incubating/aql/filters.html
>>>> <http://dl.acm.org/citation.cfm?id=2786007>
>>>> http://dl.acm.org/citation.cfm?id=2786007
>>>>
>>>> Cheers,
>>>> Sattam
>>>>
>>>> On Sun, May 29, 2016 at 8:58 AM, Michael Carey < <mjcarey@ics.uci.edu>
>>>> mjcarey@ics.uci.edu> wrote:
>>>>
>>>>> @Pouria: Please share your findings here when you check this out -
>>>>> this is quite strange, since none of the other performance results that
>>>>> have been obtained on the system have looked anything like this.  (I
will
>>>>> try to look at this too at some point, but will unfortunately be MIA
from
>>>>> June 1-15 first.)  Weird....
>>>>>
>>>>> On 5/26/16 9:20 AM, Pouria Pirzadeh wrote:
>>>>>
>>>>> Hi Magnus,
>>>>>
>>>>> Thanks for your email and sharing the information.
>>>>> If it is Ok with you, Would you please share with us the exact DDL
>>>>> (including type definitions, dataset and index definitions) and exact
AQL
>>>>> queries that you ran against AsterixDB ?
>>>>> I am just interested in checking the query plans and see what ended up
>>>>> being run as jobs.
>>>>>
>>>>> Thanks.
>>>>> Pouria
>>>>>
>>>>> On Thu, May 26, 2016 at 4:59 AM, Magnus Kongshem <
>>>>> <kongshem@online.ntnu.no>kongshem@online.ntnu.no> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> There has been a lot of questions from me regarding AsterixDB and
I
>>>>>> thank all of you who have answered me. So it is time for me to contribute
>>>>>> with some obeservations. I am writing my master thesis where I test
>>>>>> multiple databases on a large data set. I should also mention that
I have
>>>>>> installed AsterixDB on a single machine.
>>>>>>
>>>>>> What I have observed is that asterixDB has a "poorer" read
>>>>>> performance when I specify indexes on the data set compared to not
>>>>>> implementing any indexes. See the attachment for details, its an
excerpt of
>>>>>> my thesis explaining and describing the queries, the indexes and
the test
>>>>>> results. Any thoughts on these test results?
>>>>>>
>>>>>> I also cannot help to notice that the read performance for a query
>>>>>> querying a small portion, medium portion and large portion of the
data set
>>>>>> is very similar. The largest query finds 75 million records and the
>>>>>> smallest query finds 3.5 million records, but almost have the same
read
>>>>>> performance. How can this be?
>>>>>>
>>>>>> Perhaps you can use these test results in the future development
of
>>>>>> asterixDB.
>>>>>>
>>>>>> I you would like, I can send you my final thesis when it's done.
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Mvh
>>>>>>
>>>>>> Magnus Alderslyst Kongshem
>>>>>> +47 415 65 906
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Mvh
>>>
>>> Magnus Alderslyst Kongshem
>>> Seniorkomiteen
>>> Online, linjeforeningen for informatikk
>>> +47 415 65 906
>>>
>>
>>
>>
>> --
>>
>> Mvh
>>
>> Magnus Alderslyst Kongshem
>> Seniorkomiteen
>> Online, linjeforeningen for informatikk
>> +47 415 65 906
>>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message