htrace-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin McCabe <>
Subject Re: HTrace API comments
Date Fri, 16 Sep 2016 00:31:49 GMT
On Mon, Sep 12, 2016, at 08:32, Roberto Attias wrote:
> Hi Colin,
> see inline
> **From:** Colin McCabe <> **To:**
>; Roberto Attias
> <> **Cc:** John D. Ament
> <>; Jake Farrell <>; Ted
> Dunning <> **Sent:** Sunday, September 11, 2016
> 10:03 PM **Subject:** Re: HTrace API comments
> On Sat, Sep 10, 2016, at 20:04, Roberto Attias wrote:
> > Hello,I have some comment/concerns regarding the HTrace API, and was
> > wondering whether extensions/changes would be considered. I'm
> > listing the
> > most important here, if there is interest we can discuss more in
> > detail.
> Welcome, Roberto!

Sorry for the delay in responses.  You have certainly given us a lot to
think about... thanks for your thoughtful comments.

> >
> > 1) From the HTrace Developer Guide:
> >
> >
> >
> > TraceScope objects manage the lifespan of Span objects. When a
> > TraceScope
> > is created, it often comes with an associated Span object. When this
> > scope is closed, the Span will be closed as well. “Closing”
> > the scope
> > means that the span is sent to a SpanReceiver for processing.
> >
> >
> > One of the implications of this model is the fact that nested
> > spans (for
> > example instrumenting nested function calls) will be delivered
> > to the
> > receiver in reverse order (as the innermost function completes
> > before the
> > outermost. This may introduce more complexity on the logic in
> > the span
> > receiver.
> Hmm.  While I would never say never, in the existing span
> receivers, we
> haven't found that delivering the spans in this order results in any
> extra complexity.  What you want is a span sink that aggregates all
> the
> spans together, and supports querying spans by various things like ID,
> time, etc.  This is typically a distributed database like HBase, Kudu,
> etc.  There isn't any performance or simplicity advantage to
> delivering
> spans in time order to these databases (as far as I know, at least).
> The advantage is not on the storage front, but rather on the
> consumer side.
> For example, consider a hypothetical messaging application. A sender
> client may
> send a message to a server, the server storing the message until a
> receiver client
> logs-in to consume pending messages. Say a span captures the function
> that sends
> the message, a span captures the time spent between when the message
> is received by
> the server and consumed from it, a span captures the function that
> receives the message
> on the receiver client side. This may be a long lasting (days)
> interaction, but a consumer
> will not be able to access any of the temporary information until the
> whole transaction
> is completed. Similarly, you mention later a UI visualization issue.

I feel like there are two separate issues here:
1. The order that spans get sent to the SpanReceiver
2. Whether spans get sent prior to being closed, or whether we wait
   until they're closed to send them.

I don't think #1 is that important.  The SpanReceivers inherently
receive a stream of spans from many different threads.  Even if the
spans are in order with respect to a single thread, they will not be in
time order across all threads.

#2 is a tradeoff.  If you send out spans prior to closing them, then you
#have to do at least two sends-- since you need to update the span with
#the new end time.  My proposal for addressing #2 is to have a timeout
#beyond which we send the unclosed span.  I believe this is a good
#tradeoff which avoids sending short-lived spans twice, but still allows
#long-lived spans to work well.

> Of course, in a distributed system, just because node A sends
> out a span
> before some other node B doesn't mean that node A's spans will arrive
> before B's in the distributed database.  And also, multiple
> threads and
> nodes will be sending spans to the database, so the input to the
> database will not be in strictly ascending time order anyway.
> >
> > Also, the fact that information about a span is not delivered
> > until the
> > span is closed, relies on the program not terminating abruptly.
> > In Java
> > this is not so much of a problem, but in C what happens if a
> > series of
> > nested function calls is instrumented with spans, and the innermost
> > function crashes? As far as I can tell none of the span is
> > delivered.
> > This makes the use of the tracing API unreliable for bug analysis.
> I definitely agree that it is frustrating when a program crashes with
> spans which are buffered.  This can happen in both Java and ..
> although
> our out-of-the-box handling of shutdown hooks is better in Java.  This
> problem is difficult to avoid completely for a few different reasons:
> 1. As you commented, we don't output spans until they're complete...
> i.e., closed.
> 2. Without buffering, we end up doing an RPC per span, which is too
> costly in real-world systems
> I agree performance of a tracing API is paramount. however I've worked
> on real-time systems where a message-per API action was generated.
> There are ways to reduce the impact of that, for example by
> using a local
> proxy which does the buffering on behalf of the application.
> Communication
> with such proxy can be much more lightweight  (Unix Sockets or
> shared memory)
> than generic UDP/TCP-based RPCs. Although ultimately IMHO it
> should be left
> to the programmer to setup his tracing infra according to his/her
> particular
> user case (in some cases the complexity of an extra proxy
> running may not
> be required).

I like the idea of having a local daemon on each node to keep spans
buffered for a while.  As you said, it solves the issue of the
application crashing prior to sending all its spans.

However, this is essentially an implementation issue, not an API issue.
You don't need to change the HTrace API one bit in order to implement
this.  One really cool project for someone to do would be to write a
span receive which passed the spans to a local daemon, which then used
another span receiver to pass the spans to another spanreceiver like
HBaseSpanReceiver or HTracedSpanReceiver.

As a side note, it is interesting to imagine a world where we pass all
the spans to the local daemon, and do downsampling there.  Then we
could have a kind of sliding window of downsampling, where we keep all
the spans from the last 10 seconds, 50% of the spans from the last
minute, and 1% of the spans from the last hour, 0.1% of the spans from
the last day, etc.

The crux of the design problem here is: are we allowed to use a remote
datastore for spans, or not?  If the answer is that we can use a remote
datastore, then the local daemon doesn't really buy us anything except
crash resilience, because we still need to move the spans from the local
daemon to the central datastore (HBase, Kudu, htraced, whatever.)  And
network bandwidth is limited, especially in production.

> I would also add, one thing that is frustrating sometimes is how very
> long-running spans don't show up for a while in the GUI.
> >
> > Would you consider a change where each API call produces at
> > least one
> > event sent to the SpanReceiver?
> It would be interesting to think about giving users (or maybe
> spanreceivers?) the option of receiving the same span twice: once when
> it was first opened, and once when it was completed.  Or maybe having
> spans which were uncompleted for a certain amount of time sent out, to
> better avoid losing them in a crash.
> We'd have to think carefully about this to avoid overwhelming
> users with
> configuration knobs.  And we'd also have to document that
> SpanReceivers
> would have to be able to handle receiving the same span twice.
> Hopefully the consistency implications don't get too tricky.
> That seems to me forcing the existing model. A Span by
> definition should
> have a start and an end time. IMHO the creation of the span should be
> an event, and it's closure a different one.

Spans can be completed or uncompleted.  After all, the span object
exists in memory prior to being closed.

It is interesting to think about event-based models, but so far I
feel like there hasn't been a strong argument why they should
completely replace span-based models.  Certainly a span-based model
is more compatible with other APIs like OpenTracing, Zipkin, etc.
etc.  We are willing to go our own way on the API, but we would need
a pretty good reason.

> >
> > 2) HTrace has a concept of spans having one or more parents.  This
> > allows, for example, to capture the fact that a process makes an
> > RPC call
> > to another.  However, there is no information about when within
> > the span
> > the caller calls the callee. A caller span may have two child spans,
> > representing the fact that it made two RPC calls, but the order
> > in which
> > those were made is lost in the model (using the timestamps
> > associated to
> > the begin of the callee spans is not feasible, as there may be
> > different
> > RPC latencies, or simply the clocks may not be aligned. Also,
> > the only
> > relation captured by the API is between blocks.
> In your example, is the caller span making the two RPCs in
> parallel?  If
> so, it might be appropriate to say that the spans don't have a
> well-defined ordering.  Certainly we don't have any guarantees about
> which one will be processed first.  Which one was initiated first
> doesn't seem very interesting-- unless I'm missing something.
> Actually, in my example the two RPC calls were made consecutively by
> the same
> thread, i.e. they were sequential. I would expect concurrent calls to
> be originating
> from separate spans, one per thread. However, even in this case there
> is a difference
> between the potential order of the calls and the actual order. A well
> written program
> should behave properly whatever the order is. But finding out that
> the program
> misbehave when the calls happen in  a certain order may be invaluable.

I feel like adding nanosecond time resolution solves this problem for
all practical purposes.

For impractical purposes... if we really believe that a function call
can take less than a nanosecond (or more likely, that the clock
granularity is coarse), we can fudge things a bit such that the start
time of span N+1 is at least one nanosecond greater than the start time
of span N.  I suppose we could also put another piece of metadata on the
parent/child relationship as well.

> >
> > I propose a more general API with a concept of spans and  points
> > (timestamped sets of annotations), and cause-effect
> > relationship among
> > points. an RPC call can be represented as a point in the caller span
> > marked as cause, and a  (begin) point in the callee span marked as
> > effect. This is very flexible and allow to capture all sorts of
> > relationship, not just parent child. for example, a DMA operation
> > may be
> > initiated in a block  and captured as a point, the completion
> > captured as
> > a point in a distinct block in the same entity (an abstraction for
> > a unit
> > of concurrency)
> We're talked about tracking "points" in addition to "spans" before.
> This mainly came up in the context of tracing "point" events like
> application launches, MapReduce jobs being initiated, etc. etc.  The
> biggest objection is that spans and points have almost as much
> data (the
> main difference is points don't have an "end"), so creating a whole
> separate code pathway and storage pathway might be overkill.
> We have to
> think about this more.
> It's interesting to think about adding some kind of "comes-after"
> dependency to htrace spans, besides the parent/child dependency.  That
> has kind of a vector clock flavor.  I do wonder how often this
> is really
> a problem in practice, though...
> > 3) there doesn't seem to be any provision in the HTrace API for
> > considering clock domains. In a distributed system, there may be
> > processes running on the same host, processes running in the same
> > cluster, process running in different clusters. Different domain
> > may have
> > different degrees of clock mis-alignment. Providing indications
> > of this
> > information in the API allows the backend or UI trace building
> > to make
> > more accurate inferences on how concurrent entities line up.
> Clock skew is a very difficult problem.  Even determining how
> much clock
> skew exists is difficult problem, since all your messages from
> one node
> to another will have some latency.  There are estimation
> heuristics out
> there, but it's complex.  Even systems like AugmentedTime don't
> attempt
> to precisely quantify clock skew, but just to keep it below some
> threshold required for correctness.
> I agree. What I was thinking of is not a mechanism to estimate
> clock skew,
> but rather a mechanism where a user can configure a maximum expected
> clock skew. This information can be integrated with causality
> dependency
> imposed by "edges" (span dependencies and possibly new causality
> dependencies) to constrain topological sorting of the model graph.

I agree that this is conceptually clean, and might allow some automated
reasoning or proof process to be done on the DAG.   However, configuring
maximum clock skew seems quite complex.  I doubt most admins would know
how to do it, or be able to figure out if they were doing it wrong.  It
feels like a lot of complexity that would need a strong justification to
get end-users to bother with it.

> In general, admins run NTP on their servers.  YARN even requires this
> (or so I'm told... there is a JIRA out there I could find).  From a
> practical point of view, I'm not sure what admins would do with clock
> skew data (but perhaps there's something I haven't thought of here).
> One thing that might be interesting is some kind of way of warning
> admins if the clocks are seriously misaligned (indicating that NTP was
> down, or there was a clock adjustment mishap, or something like that).
> Traditionally, that's the job of the cluster management system, but it
> would be interesting if we could surface that information in some way.
> > 4) does the API provide a mechanism for creating "delegated traces"?
> >    what
> > I mean by this is that in some circumstances  some thread may
> > need to
> > create traces on behalf of some other element which may not
> > have such
> > capabilty. For example, a mobile device may have some custom tracing
> > mechanism, and attach the information to a request for the
> > server. The
> > server would then need to create the HTrace trace from the
> > existing data
> > passed in the request (including timestamps)
> Sure.  In this case, the server can just create a span from JSON the
> client sent using the MilliSpanDeserializer.  If you don't want to use
> JSON for some reason, you can construct an arbitrary span object using
> MilliSpan#Builder.
> > Let me know if there is interest in discussing changes at this
> > level.
> > Thanks,
> >                     Roberto
> Sure.  I have to warn you that we have a strong bias towards
> compatible
> changes, though.  It is difficult to get all the downstream
> projects to
> change how they use the API, even when there is a strong reason to
> change.  Almost as hard as getting Hadoop to do a new release :)
> I understand that. To be honest I have a clean room implementation
> of an API based on my previous experiences with tracing, but I'm
> trying to see whether this could be captured by extensions to the
> existing
> HTrace API.

This is a really hard question to answer.  Normally people come to us
with small API tweaks like adding an extra field or function.  It seems
like your model is significantly conceptually different than what we
have.  If I understand correctly, you are using points rather than
spans, and focusing on long-running traces rather than requests.

The HTrace API uses semantic versioning.  So we do not make backwards
incompatible changes to the Java or C API in HTrace 4.x.  We would
reserve HTrace 5.x for that.  However, we still have some downstream
projects we need to move off of HTrace 3.x, which had a different API
(it had only 64-bit span IDs, and did not support multiple parents of a
span, for example).  So from an end-user point of view, I'm a little
worried that we should not churn major API versions too quickly, before
getting more adoption.

I think the key here is to identify exactly what model you want, and why
end-users of HTrace should want that model, rather than the existing
one.  If you are OK with your API becoming Apache-licensed, you could
post it as a patch to JIRA for people to look at.  You don't need any
implementation unless you want to post it.  Keep in mind I'm not
promising that we will adopt anything, or not adopt anything, only that
we'll take a look.

Thanks again for the really interesting ideas and for being a part of
the community.


> I'm curious if you have a project you are thinking about instrumenting
> with HTrace.  We would love to hear more about how people are using
> HTrace or plan to use it, so we can build what people want.
> I don't have a specific project right now. I've been working on
> tracing at CISCO
> and Facebook in the last few years, and I'm in between gigs
> right now, so
> I'm interested in crystallizing my experience into an open source
> framework.
> cheers,
> Colin
> >

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message