asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Jacobs <sjaco...@ucr.edu>
Subject Re: Metadata changes
Date Tue, 15 Dec 2015 22:40:10 GMT
Some new light to add to this discussion:

These metadata secondary "indexes" currently boil down to one single
use-case with three lookups:

When deleting a datatype:
1) confirm that it isn't used by any dataset
2) confirm that it isn't used by any other datatype
If both are true than delete this type and
3) find and delete its subtypes

I discovered today that there isn't a single test in the testsuite that
actually covers any of the three events.

Even worse, if you actually try to hit one of these checks, the code breaks
on master as follows:

A) Drop a datatype used by a dataset = throw confusing exception to the user
B) Drop a datatype used by another datatype = complete successfully, break
the metadata for future queries

These have been broken for at least two years without anyone coming across
them.

It is possible that the "indexes" could be helpful in querying the Metadata
in general (although they are being ignored now),
but my question is whether there is a large return on investment or not,
since:
I) Metadata is typically small
II) Metadata queries are atypical

Steven

On Tue, Dec 15, 2015 at 11:05 AM, Mike Carey <dtabass@gmail.com> wrote:

> Good point.  I wonder what the perf implications would be - probably
> minimal if these indexes aren't used during the query compilation path.
>
>
> On 12/14/15 6:04 PM, Ildar Absalyamov wrote:
>
>> I guess the main argument for 2 would be eliminating broken metadata
>> records prior to backwards compatibility cutoff.
>> The last thing what we want to do is to be stuck with wrong
>> implementation for compatibility reasons. Once the functionality needed for
>> 3 is there we can again introduce those indexes without building
>> sophisticated migration subsystem.
>>
>> On Dec 14, 2015, at 17:55, Mike Carey <dtabass@gmail.com> wrote:
>>>
>>> SO - it seems like 3 is the right long-term answer, but not doable now?
>>> (If it was doable now, it would obviously be the ideal choice of the
>>> three.)
>>> What would be the argument for doing 2 as opposed to 1 for now?
>>> As for the question of backwards compatibility, I actually didn't sense
>>> a consensus yet.
>>> I would tentatively lean towards "right" over "backwards compatible" for
>>> this change.
>>> What are others thoughts on that?
>>> (Soon we won't have that luxury, but right now maybe we do?)
>>>
>>> On 12/14/15 3:43 PM, Steven Jacobs wrote:
>>>
>>>> We just had a UCR discussion on this topic. The issue is really with the
>>>> third "index" here. The code now is using one "index" to go in two
>>>> directions:
>>>> 1) To find datatypes that use datatype A
>>>> 2) To find datatypes that are used by datatype A.
>>>>
>>>> The way that it works now is hacked together, but designed for
>>>> performance.
>>>> So we have three choices here:
>>>>
>>>> 1) Stick to the status quo, and leave the "indexes" as they are
>>>> 2) Remove the Metadata secondary indexes, which will eliminate the hack
>>>> but
>>>> cost some performance on Metadata
>>>> 3) Implement the Metadata secondary indexes correctly as Asterix
>>>> indexes.
>>>> For this solution to work with our dataset designs, we will need to have
>>>> the ability to index homogeneous lists. In addition, we will have
>>>> reverse
>>>> compatibility issues unless we plan things out for the transition.
>>>>
>>>> What are the thoughts?
>>>>
>>>>
>>>> Orthogonally, it seems that the consensus for storing the datatype
>>>> dataverse in the dataset Metadata is to just add it as an open field at
>>>> least for now. Is that correct?
>>>>
>>>> Steven
>>>>
>>>>
>>>> On Mon, Dec 14, 2015 at 1:23 PM, Mike Carey <dtabass@gmail.com> wrote:
>>>>
>>>> Thoughts inlined:
>>>>>
>>>>> On 12/14/15 11:12 AM, Steven Jacobs wrote:
>>>>>
>>>>> Here are the conclusions that Ildar and I have drawn from looking at
>>>>>> the
>>>>>> secondary indexes:
>>>>>>
>>>>>> First of all it seems that datasets are local to node groups, but
>>>>>> dataverses can span node groups, which seems a little odd to me.
>>>>>>
>>>>>> Node groups are an undocumented but to-be-exploited-someday feature
>>>>> that
>>>>> allows datasets to be stored on less than all nodes in a given
>>>>> cluster.  As
>>>>> we face bigger clusters, we'll want to open up that possibility.  We
>>>>> will
>>>>> hopefully use them inside w/o having to make users manage them manually
>>>>> like parallel DB2 did/does.  Dataverses are really just a namespace
>>>>> thing,
>>>>> not a storage thing at all, so they are orthogonal to (and unrelated
>>>>> to)
>>>>> node groups.
>>>>>
>>>>> There are three Metadata secondary indexes:
>>>>>> GROUPNAME_ON_DATASET_INDEX,
>>>>>> DATATYPENAME_ON_DATASET_INDEX, DATATYPENAME_ON_DATATYPE_INDEX
>>>>>>
>>>>>> The first is used in only one case:
>>>>>> When dropping a node group, check if there are any datasets using
this
>>>>>> node
>>>>>> group. If so, don't allow the drop
>>>>>> BUT, this index has a field called "dataverse" which is not used
at
>>>>>> all.
>>>>>>
>>>>>> This one seems like a waste of space since we do this almost never.
>>>>> (Not
>>>>> much space, but unnecessary.)  If we keep it it should become a proper
>>>>> index.
>>>>>
>>>>> The second is used when dropping a datatype. If there is a dataset
>>>>>> using
>>>>>> this datatype, don't allow the drop.
>>>>>> Similarly, this index has a "dataverse" which is never used.
>>>>>>
>>>>>> You're about to use the dataverse part, right?  :-)  This index seems
>>>>> like
>>>>> it will be useful but should be a proper index.
>>>>>
>>>>> The third index is used to go in two cases, using two different ideas
>>>>>> of
>>>>>> "keys"
>>>>>> It seems like this should actually be two different indexes.
>>>>>>
>>>>>> I don't think I understood this comment....
>>>>>
>>>>>
>>>>> This is my understanding so far. It would be good to discuss what the
>>>>>> "correct" version should be.
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 14, 2015 at 10:12 AM, Steven Jacobs <sjaco002@ucr.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>> I'm implementing a change so that datasets can use datatypes
from
>>>>>>> alternate data verses (previously the type and set had to be
from the
>>>>>>> same
>>>>>>> dataverse). Unfortunately this means another change for Dataset
>>>>>>> Metadata
>>>>>>> (which will now store the dataverse for its type).
>>>>>>>
>>>>>>> As such, I had a couple of questions:
>>>>>>>
>>>>>>> 1) Should this change be thrown into the release branch, as it
is
>>>>>>> another
>>>>>>> Metadata change?
>>>>>>>
>>>>>>> 2) In implementing this change, I've been looking at the Metadata
>>>>>>> secondary indexes. I had a discussion with Ildar, and it seems
the
>>>>>>> thread
>>>>>>> on Metadata secondary indexes being "hacked" has been lost. Is
this
>>>>>>> also
>>>>>>> something that should get into the release? Is there anyone currently
>>>>>>> looking at it?
>>>>>>>
>>>>>>> Steven
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>> Ildar
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message