Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 47BCC200B8C for ; Mon, 12 Sep 2016 17:35:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 461AC160AB8; Mon, 12 Sep 2016 15:35:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3FF27160AB2 for ; Mon, 12 Sep 2016 17:35:13 +0200 (CEST) Received: (qmail 94882 invoked by uid 500); 12 Sep 2016 15:35:12 -0000 Mailing-List: contact dev-help@htrace.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@htrace.incubator.apache.org Delivered-To: mailing list dev@htrace.incubator.apache.org Received: (qmail 94866 invoked by uid 99); 12 Sep 2016 15:35:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Sep 2016 15:35:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 11D5E1A61A8 for ; Mon, 12 Sep 2016 15:35:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.227 X-Spam-Level: X-Spam-Status: No, score=-0.227 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 56ae_bWaGz5Q for ; Mon, 12 Sep 2016 15:35:06 +0000 (UTC) Received: from nm43-vm1.bullet.mail.ne1.yahoo.com (nm43-vm1.bullet.mail.ne1.yahoo.com [98.138.120.225]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 212595F30C for ; Mon, 12 Sep 2016 15:35:06 +0000 (UTC) Received: from [127.0.0.1] by nm43.bullet.mail.ne1.yahoo.com with NNFMP; 12 Sep 2016 15:35:05 -0000 Received: from [98.138.100.115] by nm43.bullet.mail.ne1.yahoo.com with NNFMP; 12 Sep 2016 15:32:18 -0000 Received: from [98.139.215.143] by tm106.bullet.mail.ne1.yahoo.com with NNFMP; 12 Sep 2016 15:32:18 -0000 Received: from [98.139.212.238] by tm14.bullet.mail.bf1.yahoo.com with NNFMP; 12 Sep 2016 15:32:18 -0000 Received: from [127.0.0.1] by omp1047.mail.bf1.yahoo.com with NNFMP; 12 Sep 2016 15:32:18 -0000 X-Yahoo-Newman-Property: ymail-4 X-Yahoo-Newman-Id: 406497.68315.bm@omp1047.mail.bf1.yahoo.com X-YMail-OSG: G8hSUroVM1l3QCXP8WN4n46XYLF.L8fgtxBuMeft21TEe2DEQuGdl_giJJzrEkN VAZ4A5qsI7S3GbyFR6f1asDu_Nm7zTdsDahjVLi9UfitixgBaddjB2axTrPsTiEpL4vAMIAz.UhM QRcccDsPR6Vomi3ihSs43G1kL0Iwgk24q0Niw2sJB6FlMeTxRz7cSW8kAUW_pw.oPvadH_STjRrt a.sqHvv6l2p3Gnl7rrzD4UvWRxlU3pjgx.2ZZMAJBezxIwsQav2sYsPGkj4vuU11sarpfgWSJbMf 2imu8npTRxxigVfRq6mbe3qCGuX941HHIpHLfaIXUUq268R.9rtQJM.qhvybs5YGJjmCg5Ydzm9C 1Fld0dURXVBty1XH9iCjtUQprd1Dmgxn0QdFEHDcYfomxyHJPv8voxjxO_oZcBdiRsLIPrRNRzlf paUlc00C7mNIV1cMaNespFSZ_Lk.pKt7IgC0WZZt_j8T2uqZL6g_ke_LVye2wHSp4Q8lu0D.JGVz AZdBfgWfLGyuZKUFljWFYZKjeZGpUcaxj5svAEA-- Received: from jws106239.mail.bf1.yahoo.com by sendmailws130.mail.bf1.yahoo.com; Mon, 12 Sep 2016 15:32:18 +0000; 1473694338.020 Date: Mon, 12 Sep 2016 15:32:17 +0000 (UTC) From: Roberto Attias Reply-To: Roberto Attias To: Colin McCabe , "dev@htrace.incubator.apache.org" Cc: "John D. Ament" , Jake Farrell , Ted Dunning Message-ID: <1486017111.3640803.1473694337181@mail.yahoo.com> In-Reply-To: <1473656613.1795440.722618881.2B3A74F0@webmail.messagingengine.com> References: <1041368316.3262677.1473563097881.ref@mail.yahoo.com> <1041368316.3262677.1473563097881@mail.yahoo.com> <1473656613.1795440.722618881.2B3A74F0@webmail.messagingengine.com> Subject: Re: HTrace API comments MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_3640802_1854625880.1473694337170" archived-at: Mon, 12 Sep 2016 15:35:14 -0000 ------=_Part_3640802_1854625880.1473694337170 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Colin,see inline From: Colin McCabe To: dev@htrace.incubator.apache.org; Roberto Attias =20 Cc: John D. Ament ; Jake Farrell ; Ted Dunning Sent: Sunday, September 11, 2016 10:03 PM Subject: Re: HTrace API comments =20 On Sat, Sep 10, 2016, at 20:04, Roberto Attias wrote: > Hello,I have some comment/concerns regarding the HTrace API, and was > wondering whether extensions/changes would be considered. I'm listing the > most important here, if there is interest we can discuss more in detail. Welcome, Roberto! >=20 > 1) From the HTrace Developer Guide:=20 >=20 >=20 >=20 > TraceScope objects manage the lifespan of Span objects. When a TraceScope > is created, it often comes with an associated Span object. When this > scope is closed, the Span will be closed as well. =E2=80=9CClosing=E2=80= =9D the scope > means that the span is sent to a SpanReceiver for processing. >=20 >=20 > One of the implications of this model is the fact that nested spans (for > example instrumenting nested function calls) will be delivered to the > receiver in reverse order (as the innermost function completes before the > outermost. This may introduce more complexity on the logic in the span > receiver.=20 Hmm.=C2=A0 While I would never say never, in the existing span receivers, w= e haven't found that delivering the spans in this order results in any extra complexity.=C2=A0 What you want is a span sink that aggregates all th= e spans together, and supports querying spans by various things like ID, time, etc.=C2=A0 This is typically a distributed database like HBase, Kudu, etc.=C2=A0 There isn't any performance or simplicity advantage to deliverin= g spans in time order to these databases (as far as I know, at least). The advantage is not on the storage front, but rather on the consumer side.= =20 For example, consider a hypothetical messaging application. A sender client= maysend a message to a server, the server storing the message until a rece= iver clientlogs-in to consume pending messages. Say a span captures the fun= ction that sendsthe message, a span captures the time spent between when th= e message is received bythe server and consumed from it, a span captures th= e function that receives the messageon the receiver client side. This may b= e a long lasting (days) interaction, but a consumerwill not be able to acce= ss any of the temporary information until the whole transactionis completed= . Similarly, you mention later a UI visualization issue. Of course, in a distributed system, just because node A sends out a spanbef= ore some other node B doesn't mean that node A's spans will arrive before B's in the distributed database.=C2=A0 And also, multiple threads an= d nodes will be sending spans to the database, so the input to the database will not be in strictly ascending time order anyway. >=20 > Also, the fact that information about a span is not delivered until the > span is closed, relies on the program not terminating abruptly. In Java > this is not so much of a problem, but in C what happens if a series of > nested function calls is instrumented with spans, and the innermost > function crashes? As far as I can tell none of the span is delivered. > This makes the use of the tracing API unreliable for bug analysis. I definitely agree that it is frustrating when a program crashes with spans which are buffered.=C2=A0 This can happen in both Java and .. althoug= h our out-of-the-box handling of shutdown hooks is better in Java.=C2=A0 This problem is difficult to avoid completely for a few different reasons: 1. As you commented, we don't output spans until they're complete... i.e., closed. 2. Without buffering, we end up doing an RPC per span, which is too costly in real-world systems I agree performance of a tracing API is paramount. however I've worked=20 on real-time systems where a message-per API action was generated.=20 There are ways to reduce the impact of that, for example by using a localpr= oxy which does the buffering on behalf of the application. Communicationwit= h such proxy can be much more lightweight=C2=A0 (Unix Sockets or shared mem= ory)than generic UDP/TCP-based RPCs. Although ultimately IMHO it should be = leftto the programmer to setup his tracing infra according to his/her parti= cularuser case (in some cases the complexity of an extra proxy running may = notbe required). I would also add, one thing that is frustrating sometimes is how very long-running spans don't show up for a while in the GUI. >=20 > Would you consider a change where each API call produces at least one > event sent to the SpanReceiver?=20 It would be interesting to think about giving users (or maybe spanreceivers?) the option of receiving the same span twice: once when it was first opened, and once when it was completed.=C2=A0 Or maybe having spans which were uncompleted for a certain amount of time sent out, to better avoid losing them in a crash. We'd have to think carefully about this to avoid overwhelming users with configuration knobs.=C2=A0 And we'd also have to document that SpanReceiver= s would have to be able to handle receiving the same span twice.=20 Hopefully the consistency implications don't get too tricky. That seems to me forcing the existing model. A Span by definition shouldhav= e a start and an end time. IMHO the creation of the span should be=20 an event, and it's closure a different one. >=20 > 2) HTrace has a concept of spans having one or more parents.=C2=A0 This > allows, for example, to capture the fact that a process makes an RPC call > to another.=C2=A0 However, there is no information about when within the = span > the caller calls the callee. A caller span may have two child spans, > representing the fact that it made two RPC calls, but the order in which > those were made is lost in the model (using the timestamps associated to > the begin of the callee spans is not feasible, as there may be different > RPC latencies, or simply the clocks may not be aligned. Also, the only > relation captured by the API is between blocks.=20 In your example, is the caller span making the two RPCs in parallel?=C2=A0 = If so, it might be appropriate to say that the spans don't have a well-defined ordering.=C2=A0 Certainly we don't have any guarantees about which one will be processed first.=C2=A0 Which one was initiated first doesn't seem very interesting-- unless I'm missing something. Actually, in my example the two RPC calls were made consecutively by the sa= methread, i.e. they were sequential. I would expect concurrent calls to be = originatingfrom separate spans, one per thread. However, even in this case = there is a differencebetween the potential order of the calls and the actua= l order. A well written programshould behave properly whatever the order is= . But finding out that the programmisbehave when the calls happen in=C2=A0 = a certain order may be invaluable. >=20 > I propose a more general API with a concept of spans and=C2=A0 points > (timestamped sets of annotations), and cause-effect relationship among > points. an RPC call can be represented as a point in the caller span > marked as cause, and a=C2=A0 (begin) point in the callee span marked as > effect. This is very flexible and allow to capture all sorts of > relationship, not just parent child. for example, a DMA operation may be > initiated in a block=C2=A0 and captured as a point, the completion captur= ed as > a point in a distinct block in the same entity (an abstraction for a unit > of concurrency)=20 We're talked about tracking "points" in addition to "spans" before.=20 This mainly came up in the context of tracing "point" events like application launches, MapReduce jobs being initiated, etc. etc.=C2=A0 The biggest objection is that spans and points have almost as much data (the main difference is points don't have an "end"), so creating a whole separate code pathway and storage pathway might be overkill.=C2=A0 We have = to think about this more. It's interesting to think about adding some kind of "comes-after" dependency to htrace spans, besides the parent/child dependency.=C2=A0 That has kind of a vector clock flavor.=C2=A0 I do wonder how often this is real= ly a problem in practice, though... > 3) there doesn't seem to be any provision in the HTrace API for > considering clock domains. In a distributed system, there may be > processes running on the same host, processes running in the same > cluster, process running in different clusters. Different domain may have > different degrees of clock mis-alignment. Providing indications of this > information in the API allows the backend or UI trace building to make > more accurate inferences on how concurrent entities line up. Clock skew is a very difficult problem.=C2=A0 Even determining how much clo= ck skew exists is difficult problem, since all your messages from one node to another will have some latency.=C2=A0 There are estimation heuristics ou= t there, but it's complex.=C2=A0 Even systems like AugmentedTime don't attemp= t to precisely quantify clock skew, but just to keep it below some threshold required for correctness. I agree. What I was thinking of is not a mechanism to estimate clock skew,b= ut rather a mechanism where a user can configure a maximum expectedclock sk= ew. This information can be integrated with causality dependencyimposed by = "edges" (span dependencies and possibly new causalitydependencies) to const= rain topological sorting of the model graph. In general, admins run NTP on their servers.=C2=A0 YARN even requires this (or so I'm told... there is a JIRA out there I could find).=C2=A0 From a practical point of view, I'm not sure what admins would do with clock skew data (but perhaps there's something I haven't thought of here). One thing that might be interesting is some kind of way of warning admins if the clocks are seriously misaligned (indicating that NTP was down, or there was a clock adjustment mishap, or something like that).=20 Traditionally, that's the job of the cluster management system, but it would be interesting if we could surface that information in some way. > 4) does the API provide a mechanism for creating "delegated traces"? what > I mean by this is that in some circumstances=C2=A0 some thread may need t= o > create traces on behalf of some other element which may not have such > capabilty. For example, a mobile device may have some custom tracing > mechanism, and attach the information to a request for the server. The > server would then need to create the HTrace trace from the existing data > passed in the request (including timestamps) Sure.=C2=A0 In this case, the server can just create a span from JSON the client sent using the MilliSpanDeserializer.=C2=A0 If you don't want to use JSON for some reason, you can construct an arbitrary span object using MilliSpan#Builder. > Let me know if there is interest in discussing changes at this level. > Thanks, > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Roberto Sure.=C2=A0 I have to warn you that we have a strong bias towards compatibl= e changes, though.=C2=A0 It is difficult to get all the downstream projects t= o change how they use the API, even when there is a strong reason to change.=C2=A0 Almost as hard as getting Hadoop to do a new release :) I understand that. To be honest I have a clean room implementationof an API= based on my previous experiences with tracing, but I'm=20 trying to see whether this could be captured by extensions to the existingH= Trace API. I'm curious if you have a project you are thinking about instrumenting with HTrace.=C2=A0 We would love to hear more about how people are using HTrace or plan to use it, so we can build what people want. I don't have a specific project right now. I've been working on tracing at = CISCOand Facebook in the last few years, and I'm in between gigs right now,= so=C2=A0I'm interested in crystallizing my experience into an open source = framework. cheers, Colin > =20 ------=_Part_3640802_1854625880.1473694337170--