impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [hs2client-CR] Initial structure for the C++ hiveserver2 client.
Date Sat, 07 May 2016 00:00:30 GMT
Thomas Tauber-Marshall has posted comments on this change.

Change subject: Initial structure for the C++ hiveserver2 client.
......................................................................


Patch Set 12:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/columnar-row-set.h
File src/hs2client/columnar-row-set.h:

Line 54:   virtual const uint8_t* nulls() const = 0;
> let's stick with impala naming conventions
What convention?


Line 62:   TypedColumn(const uint8_t* nulls, const std::vector<T>* data)
> does this need to be public?
Done


http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/fetch-results.h
File src/hs2client/fetch-results.h:

Line 25: // Row oriented results are not currently supported.
> do we plan on supporting this in the future? (and if so, why?)
Its not a priority, but we'd like to leave ourselves the ability to support it in the future,
as some workloads will be more efficient will row oriented results, and we're trying to keep
this as general and close to the hs2 spec as possible in case we want to use this as a client
for other non-impala hs2 services in the future.


http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/hs2service.h
File src/hs2client/hs2service.h:

Line 65: // HS2Service objects are created using HS2Service::Connect. They must
> function references end in ()
Done


Line 76: class HS2Service {
> what was the rationale for prefixing this with 'HS2' (and session) but not 
I did it this way because that's how it is in Impyla, but of course this is our opportunity
to correct mistakes from Impyla, and you make a good point, so I'll remove the HS2 and just
rely on the namespace for disambiguation.


Line 109:   Status OpenSession(const std::string& user, const HS2ClientConfig& config,
> is 'user' sufficient as an authentication mechanism?
The authentication happens in Service::Connect.

This brings up the good point, though, that the definition of Connect above doesn't really
make any sense, since if use_ssl was true you would also need to specify a username/password/protocol,
so I'll remove use_ssl and when we eventually add authentication support we'll add another
Connect function that takes the parameters needed for that.


http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/hs2session.h
File src/hs2client/hs2session.h:

Line 48:       std::shared_ptr<Operation>* operation) const;
> the class comment for Operation says it's not thread-safe, so why return th
Part of the original thinking was that Operations will be passed around a lot by client code,
and its easier to do that if they're in shared_ptrs. I agree that its questionable, though,
especially since we're using unique_ptrs everywhere else.

I'd like to confirm with Wes before making this change, though.


Line 55:   Status GetSchemas(std::shared_ptr<Operation>* operation) const;
> what exactly does this do? (what is role of operation?)
The Operation gives us access to its state, such as the log and profile. Much of this is uninteresting
or not populated at all for these metadata operations as Impala is currently written, but
we may someday want to change that.

It also gives you more flexibility, eg. you may want to fetch the results in batches of a
certain size for something like GetColumns that might return a large number of results, or
you may want to fetch the results in a different order, such as refetching the prior batch
of result, once Impala supports different fetch orientations.

I may be wrong that anyone would ever want to do those things, but it doesn't seem to hurt
too much to leave the option there.


Line 59:   Status GetTables(std::shared_ptr<Operation>* operation) const;
> instead of returning the results of these metadata requests as generic oper
As above, I'm not convinced that giving users access to Operation objects for these calls
isn't useful (or at the very least, more flexible and future-proof), but you're correct that
we could do more here to make this easy for users. I filed an issue for this: https://github.com/cloudera/hs2client/issues/8


http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/operation.h
File src/hs2client/operation.h:

Line 40: enum class OperationState {
> move into class Operation and drop 'Operation' from name?
Done


Line 41:   INITIALIZED_STATE,
> embedding the type name in the enum constants is common practice in c to di
Done


Line 52: // and to retrieve its results.
> how do you get result set metadata? (the GetResultSetMetadata rpc)
Coming in a follow up review.


Line 73:   // Fetches a batch of results, stores them in 'results', and sets has_more_rows.
> you should point out in the function comments whether a function blocks. i'
Done


Line 104:   // True if this operation has begun execution and has not been closed yet.
> how does this related to OperationState?
If it is false, that either means that the operation was never successfully created or that
it has been closed. Either way, its awkward to rely on calling GetState to get this information,
as Impala won't ever actually return a CLOSED state - in both situations Impala won't have
any information about the operation so what we'll actually get back is a "invalid query handle"
error.

The comment is a little misleading, as its not clear if "begun execution" applies when the
operation is in state INITIALIZED, so I updated it to be clearer.


http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/sample-usage.cc
File src/hs2client/sample-usage.cc:

Line 85:     total_retrieved += int_col->length();
> it feels like the number of rows should also be part of FetchResults
The problem with doing that is that it would require us to know what the type of at least
one of the columns is when the FetchResult is created in Operation::Fetch so that we can extract
the column data from the thrift object. The only way to do that would be to also make a call
to GetResultSetMetadata, which seems like overhead that we wouldn't want to add to every call
to Fetch, or to require the user to specify the schema when creating the operation or calling
Fetch, which seems like an unnecessary complication.


Line 87:       int32_t int_val = int_col->data()[i];
> why not also add a GetData(int), so you don't always have to deal with the 
Done


http://gerrit.cloudera.org:8080/#/c/2645/12/src/hs2client/util.cc
File src/hs2client/util.cc:

Line 101:     hs2::TGetResultSetMetadataReq req;
> i think this part of hs2 functionality should be exposed in a consumable wa
Yes, we're working on a follow up patch with this functionality, we're just trying not to
keep adding things to this review so that we can actually get it finalized and in so that
Wes can start writing the python wrapper.


-- 
To view, visit http://gerrit.cloudera.org:8080/2645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7be82d5897ad6093116915b17924674c4ae508a0
Gerrit-PatchSet: 12
Gerrit-Project: hs2client
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Wes McKinney <wes@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message