asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Kongshem <>
Subject Re: Indexes not performing well
Date Tue, 31 May 2016 08:43:11 GMT
On Tue, May 31, 2016 at 6:25 AM, Pouria Pirzadeh <>

> Hi Magnus,
> Thanks for spending time to gather useful information and sharing them
> with us.
No problem, we both benefit from this.

> I guess the first potential performance pitfall that you may be falling
> into is the large selectivity of the filters you are using against indexed
> fields, as Sattam also pointed out.
> 1) With respect to the attached plans to the previous email, did you add
> the plans before removing indices or after that ? Because in the plan for
> Q2, I can still see that the hour index (which you mentioned you dropped)
> is being used at the bottom.
I added the plans before removing the hour and day index. I did however
test the queries again, where the result was that the query used the
"stamp" index instead and the execution time was still the same.

> 2) If it is not too painful for you, can you also try two sets of
> additional queries (their repose time could be helpful):
>    - Running a simple count query (without any filter) on the dataset (I
>    just wanna figure out what is the total time for a pure full scan).
>  Sure! The execution time is 267 seconds, interesting result. You will
find the optimized logical plan in the attachments.

>    - Running those queries that do not benefit from secondary indices
>    (those with similar times across different versions) without *any* index.
>    This basically can be done by simply creating a new dataset without any
>    secondary index, loading your data into it (you can either use bulk load to
>    load from raw data file(s), OR an "insert into" query to load from your
>    exiting dataset to the new one) and then running those queries against the
>    new dataset. I just wanna find out if ignoring all the indices makes a
>    difference or not.
>  I already did this. I created a data set containing the exact same data,
but without the indexes. In the pdf excerpt from my thesis, figure 6.5
shows the results. As the figure depicts, the queries do benefit greatly
from "ignoring" all indices. I re-attached the pdf in this email.



> On Mon, May 30, 2016 at 6:08 AM, Magnus Kongshem <>
> wrote:
>> Attached you will find:
>> - My DLL
>> - My queries and their optimized logical plan
>> From what I can interpret, the queries does not use any of the combined
>> indexes. And query number two, three and four uses the "hour" index which
>> is based on the hourOfDay field. As Sattam states, indexes based on this
>> field may not be optimal. I deleted that index as well as the "day" index
>> and re-ran the queries. The result was that the queries used the "stamp"
>> index instead with the same poor execution time.
>> I'm having some trouble creating a dataset with a filter. I keep getting:
>> type mismatch: missing a required closed field uid:uuid [IllegalStateException]
>> use dataverse bigd;
>>     create type table2 as closed {
>> uid: uuid,
>>         campus: string,
>> building: string,
>> floor: string,
>>         timestamp: int32,
>> dayOfWeek: int32,
>> hourOfDay: int32,
>>         latitude: double,
>>         salt_timestamp: int32,
>>         longitude: double,
>>         id: string,
>> accuracy: double
>>     }
>> create dataset posdata3(table2)
>> primary key uid autogenerated with filter on timestamp;
>> load dataset posdata3 using localfs
>> (("path"="path/to/file/all.adm"),("format"="adm"));
>> BG,
>> Magnus
>> On Mon, May 30, 2016 at 10:01 AM, Magnus Kongshem <>
>> wrote:
>>> Thank you for your replies. I will test your thoughts by dropping the
>>> hourOfDay and dayOfWeek indexes, send you the optimized query plans and
>>> utilize the AsterixDB filters.
>>> I'll be back.
>>> BG,
>>> Magnus
>>> On Sun, May 29, 2016 at 8:32 AM, Sattam Alsubaiee <>
>>> wrote:
>>>> Creating indexes on fields with high selectivities (such as hourOfDay
>>>> and dayOfWeek) are not encouraged at all. Each secondary index lookup will
>>>> have to probe the primary index to fetch other fields in the record. It
>>>> would be much more efficient if you just perform scans as opposed of
>>>> accessing secondary indexes when querying such fields.
>>>> I would recommend that you drop at least the following indexes:
>>>> drop index posdata.hour;
>>>> drop index;
>>>> Also I would highly recommend that you utilize AsterixDB filters, which
>>>> is very good optimization (could save up to 99% of query time) when you
>>>> deal with time-correlated fields such as timestamps:
>>>> Cheers,
>>>> Sattam
>>>> On Sun, May 29, 2016 at 8:58 AM, Michael Carey <>
>>>> wrote:
>>>>> @Pouria: Please share your findings here when you check this out -
>>>>> this is quite strange, since none of the other performance results that
>>>>> have been obtained on the system have looked anything like this.  (I
>>>>> try to look at this too at some point, but will unfortunately be MIA
>>>>> June 1-15 first.)  Weird....
>>>>> On 5/26/16 9:20 AM, Pouria Pirzadeh wrote:
>>>>> Hi Magnus,
>>>>> Thanks for your email and sharing the information.
>>>>> If it is Ok with you, Would you please share with us the exact DDL
>>>>> (including type definitions, dataset and index definitions) and exact
>>>>> queries that you ran against AsterixDB ?
>>>>> I am just interested in checking the query plans and see what ended up
>>>>> being run as jobs.
>>>>> Thanks.
>>>>> Pouria
>>>>> On Thu, May 26, 2016 at 4:59 AM, Magnus Kongshem <
>>>>>> wrote:
>>>>>> Hi,
>>>>>> There has been a lot of questions from me regarding AsterixDB and
>>>>>> thank all of you who have answered me. So it is time for me to contribute
>>>>>> with some obeservations. I am writing my master thesis where I test
>>>>>> multiple databases on a large data set. I should also mention that
I have
>>>>>> installed AsterixDB on a single machine.
>>>>>> What I have observed is that asterixDB has a "poorer" read
>>>>>> performance when I specify indexes on the data set compared to not
>>>>>> implementing any indexes. See the attachment for details, its an
excerpt of
>>>>>> my thesis explaining and describing the queries, the indexes and
the test
>>>>>> results. Any thoughts on these test results?
>>>>>> I also cannot help to notice that the read performance for a query
>>>>>> querying a small portion, medium portion and large portion of the
data set
>>>>>> is very similar. The largest query finds 75 million records and the
>>>>>> smallest query finds 3.5 million records, but almost have the same
>>>>>> performance. How can this be?
>>>>>> Perhaps you can use these test results in the future development
>>>>>> asterixDB.
>>>>>> I you would like, I can send you my final thesis when it's done.
>>>>>> --
>>>>>> Mvh
>>>>>> Magnus Alderslyst Kongshem
>>>>>> +47 415 65 906 <%2B47%20415%2065%20906>

View raw message