impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Francke <lars.fran...@gmail.com>
Subject Re: Questions about Statestore and Catalogservice
Date Sun, 10 Dec 2017 14:47:41 GMT
Thank you Bharath & Dimitris!

That answers all the questions I have right now, thank you so much for
taking the time to write it up.

Regarding the docs: <
https://impala.apache.org/docs/build/html/topics/impala_components.html>

> The Impala component known as the catalog service relays the metadata
changes from Impala SQL statements to all the DataNodes in a cluster. It is
physically represented by a daemon process named catalogd; you only need
such a process on one host in the cluster. Because the requests are passed
through the statestore daemon, it makes sense to run the statestored and
catalogd services on the same host.

Reading it again now it also says "DataNodes" which is not correct.

Cheers,
Lars


On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <bharathv@cloudera.com>
wrote:

> Looks like a topic for dev@.
>
> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <lars.francke@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to understand how the communication between the components
>> works.
>>
>> I understand that an impala daemon subscribes to the statestore. The
>> statestore seems to have the concept of heartbeats and topics. But I'm not
>> sure what topics are all about.
>>
>
> Statestore follows the standard pub-sub pattern
> <https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern> where a
> publisher publishes messages and subscribers subscribe to the
> messages/categories they are interested in.  Like you mentioned, statestore
> is like a mediator between the publishers and the subscribers.
>
> "Topic" is an abstraction that makes the content of these messages opaque
> to the statestore. The publishers (like Catalog server for example)
> serialize the messages (metadata for example) into a "Topic" to ship them
> to the statestore which then broadcasts that to the interested subscribers
> (coordinators). The coordinators then unpack/deserialize the topic into the
> corresponding object classes (like Tables/Functions etc.) and apply those
> updates locally.
>
> In Impala, currently we have the following topics:
>
> catalog-update - For Catalog metadata
> impala-membership - For tracking liveness of the coordinators/executors
> impala-request-queue -  For admission control
>
> You can see these in the statestore web UI (/topics page)
>
>>
>>
>> The docs also say that only the statestore communicates with the catalog
>> service. How does that happen?
>>
>
> Can you point us to which doc you are referring to here?
>
> Techincally speaking, the coordinators also connect to the Catalog service
> for executing DDLs, but I'm assuming you are speaking here in terms of the
> broadcast of the table updates, in which case Catalog sends those tables to
> the statestore (as a part of catalog-update topic) and those are broadcast
> by the statestore to all the coordinators. (described above)
>
> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
> service and back?
>
> I'll take the example of REFRESH here.  The metadata flow looks something
> like this
>
> - coordinator 'coo' gets 'refresh foo'
> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
> v1 to v2 (Internally Catalog versions all the objects to track which
> objects changed over time)
> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
> result of RPC) which then applies the update locally.
> - Additionally 'cat' also has a thread running in the background that
> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
> into a "Topic" update and sends it to the statestore.
> - Statestore then broadcasts the new updates to all the coordinators.
>
> INVALIDATE is slightly different in the sense that the coordinator doesn't
> get foo(v2) back as the result of the rpc, instead it gets an
> "IncompleteTable" (Impala terminology) which means that the table is either
> missing the catalog metadata/it has been invalidated.
>
> There are many minor details on how the entire system works but "most"
> Catalog updates work as above (with some exceptions).
>
>
>> I'm sure I'll have follow-up questions but this would already be very
>> helpful. Thank you!
>>
>
> Sure, feel free to ask the list. Here are some code pointers incase you
> are interested.
>
> https://github.com/apache/impala/blob/master/be/src/
> statestore/statestore.h (Topic/TopicEntry and other SS  abstractions)
>
> https://github.com/apache/impala/blob/master/common/
> thrift/CatalogService.thrift#L45 (thrift definitions for most Catalog
> operations)
>
> https://github.com/apache/impala/blob/master/be/src/
> service/impala-server.h#L324 (how coordinators apply the Catalog updates)
>
> https://github.com/apache/impala/blob/master/be/src/
> catalog/catalog-server.cc#L187 (An example of how "catalog-update" is
> created & subscribed)
>
> HTH.
>
>
>> Cheers,
>> Lars
>>
>
>

Mime
View raw message