asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Metadata changes
Date Tue, 15 Dec 2015 23:42:13 GMT
PS - I think we also want to change the record format so that individual 
records know exactly how many "closed fields" they have (even if they 
don't know exactly what they are :-)).

On 12/14/15 7:09 PM, Till Westmann wrote:
> On 14 Dec 2015, at 18:55, Murtadha Hubail wrote:
>
>> I think the backward compatibility discussion goes beyond metadata 
>> indexes and a complete plan that considers everything in storage 
>> should be developed to support upgrading and patching. Just as an 
>> example when we did the repacking from edu.uci to org.apache, all 
>> existing instances on edu.uci wouldn’t work on new binaries due to 
>> Java serialization on edu.uci classes.
>
> Good point. Do you know if we fixed that or did we just leave it as-is?
>
>> Having said that, I would go with the right long term solution for 
>> metadata indexes which would’ve been a result of the backward 
>> compatibility plan if we had one.
>
> I tend to agree here. I think that we’ll need a backwards 
> compatibility story, even if we choose to be schema-less for all 
> metadata.
> 1) Even if the metadata is all flexible, we’ll be able to read the old 
> metadata, but we’ll need to keep code around to read all versions of 
> the metadata.
> 2) If we need to change the file format for the data we’ll also need a 
> way to realize that (and that would probably affect the metadata as 
> well).
>
> I think that it might be a good start to add version identifiers to 
> persisted data structures, so that we’d at least be able to 
> distinguish different versions (and potentially have the ability to 
> provide some migration - of needed).
>
> Thoughts?
>
> Cheers,
> Till
>
>>> On Dec 14, 2015, at 6:19 PM, Ildar Absalyamov 
>>> <ildar.absalyamov@gmail.com> wrote:
>>>
>>> As for general topic of backwards compatibility I think going “fully 
>>> open” might be the best longterm solution.
>>> Once in a while the topic of changing metadata keeps reappearing and 
>>> there is no guarantee it will not strike back again. Opening up 
>>> metadata will release ourselves from burden of producing migration 
>>> tools and shipping them with the new version of the binaries with 
>>> revised catalog.
>>> The performance (mainly storage) impacts of that solution will be 
>>> tolerable especially considering how much data is usually stored in 
>>> metadata.
>>> Moreover, being big proponents of semi-structured data, it does make 
>>> perfect sense for us to eat our own dog food here.
>>>
>>>> On Dec 14, 2015, at 18:04, Ildar Absalyamov 
>>>> <ildar.absalyamov@gmail.com> wrote:
>>>>
>>>> I guess the main argument for 2 would be eliminating broken 
>>>> metadata records prior to backwards compatibility cutoff.
>>>> The last thing what we want to do is to be stuck with wrong 
>>>> implementation for compatibility reasons. Once the functionality 
>>>> needed for 3 is there we can again introduce those indexes without 
>>>> building sophisticated migration subsystem.
>>>>
>>>>> On Dec 14, 2015, at 17:55, Mike Carey <dtabass@gmail.com> wrote:
>>>>>
>>>>> SO - it seems like 3 is the right long-term answer, but not doable 
>>>>> now?
>>>>> (If it was doable now, it would obviously be the ideal choice of 
>>>>> the three.)
>>>>> What would be the argument for doing 2 as opposed to 1 for now?
>>>>> As for the question of backwards compatibility, I actually didn't 
>>>>> sense a consensus yet.
>>>>> I would tentatively lean towards "right" over "backwards 
>>>>> compatible" for this change.
>>>>> What are others thoughts on that?
>>>>> (Soon we won't have that luxury, but right now maybe we do?)
>>>>>
>>>>> On 12/14/15 3:43 PM, Steven Jacobs wrote:
>>>>>> We just had a UCR discussion on this topic. The issue is really 
>>>>>> with the
>>>>>> third "index" here. The code now is using one "index" to go in two
>>>>>> directions:
>>>>>> 1) To find datatypes that use datatype A
>>>>>> 2) To find datatypes that are used by datatype A.
>>>>>>
>>>>>> The way that it works now is hacked together, but designed for 
>>>>>> performance.
>>>>>> So we have three choices here:
>>>>>>
>>>>>> 1) Stick to the status quo, and leave the "indexes" as they are
>>>>>> 2) Remove the Metadata secondary indexes, which will eliminate 
>>>>>> the hack but
>>>>>> cost some performance on Metadata
>>>>>> 3) Implement the Metadata secondary indexes correctly as Asterix

>>>>>> indexes.
>>>>>> For this solution to work with our dataset designs, we will need

>>>>>> to have
>>>>>> the ability to index homogeneous lists. In addition, we will have

>>>>>> reverse
>>>>>> compatibility issues unless we plan things out for the transition.
>>>>>>
>>>>>> What are the thoughts?
>>>>>>
>>>>>>
>>>>>> Orthogonally, it seems that the consensus for storing the datatype
>>>>>> dataverse in the dataset Metadata is to just add it as an open 
>>>>>> field at
>>>>>> least for now. Is that correct?
>>>>>>
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 14, 2015 at 1:23 PM, Mike Carey <dtabass@gmail.com>

>>>>>> wrote:
>>>>>>
>>>>>>> Thoughts inlined:
>>>>>>>
>>>>>>> On 12/14/15 11:12 AM, Steven Jacobs wrote:
>>>>>>>
>>>>>>>> Here are the conclusions that Ildar and I have drawn from

>>>>>>>> looking at the
>>>>>>>> secondary indexes:
>>>>>>>>
>>>>>>>> First of all it seems that datasets are local to node groups,
but
>>>>>>>> dataverses can span node groups, which seems a little odd
to me.
>>>>>>>>
>>>>>>> Node groups are an undocumented but to-be-exploited-someday 
>>>>>>> feature that
>>>>>>> allows datasets to be stored on less than all nodes in a given

>>>>>>> cluster.  As
>>>>>>> we face bigger clusters, we'll want to open up that 
>>>>>>> possibility.  We will
>>>>>>> hopefully use them inside w/o having to make users manage them

>>>>>>> manually
>>>>>>> like parallel DB2 did/does.  Dataverses are really just a 
>>>>>>> namespace thing,
>>>>>>> not a storage thing at all, so they are orthogonal to (and 
>>>>>>> unrelated to)
>>>>>>> node groups.
>>>>>>>
>>>>>>>> There are three Metadata secondary indexes:  
>>>>>>>> GROUPNAME_ON_DATASET_INDEX,
>>>>>>>> DATATYPENAME_ON_DATASET_INDEX, DATATYPENAME_ON_DATATYPE_INDEX
>>>>>>>>
>>>>>>>> The first is used in only one case:
>>>>>>>> When dropping a node group, check if there are any datasets

>>>>>>>> using this
>>>>>>>> node
>>>>>>>> group. If so, don't allow the drop
>>>>>>>> BUT, this index has a field called "dataverse" which is not

>>>>>>>> used at all.
>>>>>>>>
>>>>>>> This one seems like a waste of space since we do this almost

>>>>>>> never. (Not
>>>>>>> much space, but unnecessary.)  If we keep it it should become
a 
>>>>>>> proper
>>>>>>> index.
>>>>>>>
>>>>>>>> The second is used when dropping a datatype. If there is
a 
>>>>>>>> dataset using
>>>>>>>> this datatype, don't allow the drop.
>>>>>>>> Similarly, this index has a "dataverse" which is never used.
>>>>>>>>
>>>>>>> You're about to use the dataverse part, right?  :-) This index

>>>>>>> seems like
>>>>>>> it will be useful but should be a proper index.
>>>>>>>
>>>>>>>> The third index is used to go in two cases, using two different

>>>>>>>> ideas of
>>>>>>>> "keys"
>>>>>>>> It seems like this should actually be two different indexes.
>>>>>>>>
>>>>>>> I don't think I understood this comment....
>>>>>>>
>>>>>>>
>>>>>>>> This is my understanding so far. It would be good to discuss

>>>>>>>> what the
>>>>>>>> "correct" version should be.
>>>>>>>> Steven
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 14, 2015 at 10:12 AM, Steven Jacobs 
>>>>>>>> <sjaco002@ucr.edu> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>> I'm implementing a change so that datasets can use datatypes
from
>>>>>>>>> alternate data verses (previously the type and set had
to be 
>>>>>>>>> from the
>>>>>>>>> same
>>>>>>>>> dataverse). Unfortunately this means another change for

>>>>>>>>> Dataset Metadata
>>>>>>>>> (which will now store the dataverse for its type).
>>>>>>>>>
>>>>>>>>> As such, I had a couple of questions:
>>>>>>>>>
>>>>>>>>> 1) Should this change be thrown into the release branch,
as it 
>>>>>>>>> is another
>>>>>>>>> Metadata change?
>>>>>>>>>
>>>>>>>>> 2) In implementing this change, I've been looking at
the Metadata
>>>>>>>>> secondary indexes. I had a discussion with Ildar, and
it seems 
>>>>>>>>> the thread
>>>>>>>>> on Metadata secondary indexes being "hacked" has been
lost. Is 
>>>>>>>>> this also
>>>>>>>>> something that should get into the release? Is there
anyone 
>>>>>>>>> currently
>>>>>>>>> looking at it?
>>>>>>>>>
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>
>>>> Best regards,
>>>> Ildar
>>>>
>>>
>>> Best regards,
>>> Ildar
>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message