impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeszy <jes...@gmail.com>
Subject Re: Questions about Statestore and Catalogservice
Date Mon, 11 Dec 2017 09:48:14 GMT
Thanks for pointing out the docs issue! I opened IMPALA-6303 to track it.

On 10 December 2017 at 15:47, Lars Francke <lars.francke@gmail.com> wrote:
> Thank you Bharath & Dimitris!
>
> That answers all the questions I have right now, thank you so much for
> taking the time to write it up.
>
> Regarding the docs:
> <https://impala.apache.org/docs/build/html/topics/impala_components.html>
>
>> The Impala component known as the catalog service relays the metadata
>> changes from Impala SQL statements to all the DataNodes in a cluster. It is
>> physically represented by a daemon process named catalogd; you only need
>> such a process on one host in the cluster. Because the requests are passed
>> through the statestore daemon, it makes sense to run the statestored and
>> catalogd services on the same host.
>
> Reading it again now it also says "DataNodes" which is not correct.
>
> Cheers,
> Lars
>
>
> On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <bharathv@cloudera.com>
> wrote:
>>
>> Looks like a topic for dev@.
>>
>> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <lars.francke@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to understand how the communication between the components
>>> works.
>>>
>>> I understand that an impala daemon subscribes to the statestore. The
>>> statestore seems to have the concept of heartbeats and topics. But I'm not
>>> sure what topics are all about.
>>
>>
>> Statestore follows the standard pub-sub pattern where a publisher
>> publishes messages and subscribers subscribe to the messages/categories they
>> are interested in.  Like you mentioned, statestore is like a mediator
>> between the publishers and the subscribers.
>>
>> "Topic" is an abstraction that makes the content of these messages opaque
>> to the statestore. The publishers (like Catalog server for example)
>> serialize the messages (metadata for example) into a "Topic" to ship them to
>> the statestore which then broadcasts that to the interested subscribers
>> (coordinators). The coordinators then unpack/deserialize the topic into the
>> corresponding object classes (like Tables/Functions etc.) and apply those
>> updates locally.
>>
>> In Impala, currently we have the following topics:
>>
>> catalog-update - For Catalog metadata
>> impala-membership - For tracking liveness of the coordinators/executors
>> impala-request-queue -  For admission control
>>
>> You can see these in the statestore web UI (/topics page)
>>
>>>
>>>
>>> The docs also say that only the statestore communicates with the catalog
>>> service. How does that happen?
>>
>>
>> Can you point us to which doc you are referring to here?
>>
>> Techincally speaking, the coordinators also connect to the Catalog service
>> for executing DDLs, but I'm assuming you are speaking here in terms of the
>> broadcast of the table updates, in which case Catalog sends those tables to
>> the statestore (as a part of catalog-update topic) and those are broadcast
>> by the statestore to all the coordinators. (described above)
>>
>> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
>> service and back?
>>
>> I'll take the example of REFRESH here.  The metadata flow looks something
>> like this
>>
>> - coordinator 'coo' gets 'refresh foo'
>> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
>> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
>> v1 to v2 (Internally Catalog versions all the objects to track which objects
>> changed over time)
>> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
>> result of RPC) which then applies the update locally.
>> - Additionally 'cat' also has a thread running in the background that
>> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
>> into a "Topic" update and sends it to the statestore.
>> - Statestore then broadcasts the new updates to all the coordinators.
>>
>> INVALIDATE is slightly different in the sense that the coordinator doesn't
>> get foo(v2) back as the result of the rpc, instead it gets an
>> "IncompleteTable" (Impala terminology) which means that the table is either
>> missing the catalog metadata/it has been invalidated.
>>
>> There are many minor details on how the entire system works but "most"
>> Catalog updates work as above (with some exceptions).
>>
>>>
>>> I'm sure I'll have follow-up questions but this would already be very
>>> helpful. Thank you!
>>
>>
>> Sure, feel free to ask the list. Here are some code pointers incase you
>> are interested.
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/statestore/statestore.h
>> (Topic/TopicEntry and other SS  abstractions)
>>
>>
>> https://github.com/apache/impala/blob/master/common/thrift/CatalogService.thrift#L45
>> (thrift definitions for most Catalog operations)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L324
>> (how coordinators apply the Catalog updates)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/catalog/catalog-server.cc#L187
>> (An example of how "catalog-update" is created & subscribed)
>>
>> HTH.
>>
>>>
>>> Cheers,
>>> Lars
>>
>>
>

Mime
View raw message