sphinx.addnodesdocument)}( rawsourcechildren](docutils.nodescomment)}(hParsl wide event observability prototype report documentation master file, created by sphinx-quickstart on Sun Oct 26 10:13:01 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive.h]h TextParsl wide event observability prototype report documentation master file, created by sphinx-quickstart on Sun Oct 26 10:13:01 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive.}parenth sba attributes}(ids]classes]names]dupnames]backrefs] xml:spacepreserveutagnameh hh _documenthsource3/home/benc/parsl/src/observability-report/index.rstlineKubh section)}(hhh](h title)}(h/Parsl wide event observability prototype reporth]h/Parsl wide event observability prototype report}(hh1h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hh,h&hh'h(h)Kubh compound)}(hhh]htoctree)}(hhh]h}(h]h]h]h]h!]hindexentries] includefiles]maxdepthKcaption Contents:globhidden includehiddennumberedK titlesonly rawentries] rawcaptionhVuh%hDh'h(h)K hhAubah}(h]h]toctree-wrapperah]h]h!]uh%h?hh,h&hh'h(h)Nubh+)}(hhh](h0)}(h Introductionh]h Introduction}(hhih&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hhfh&hh'h(h)Kubh paragraph)}(h (parsl.log) 2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log) 2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log) 2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log) 2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log) 2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log) 2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log) 2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log) 2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log) 2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log) 2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log) 2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log) 2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log) 2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log)h]hXv === About task 358 === 2025-10-26 10:29:46.467298 MainThread@117098 Task 358: will be sent to executor htex_Local (parsl.log) 2025-10-26 10:29:46.467412 MainThread@117098 Task 358: Adding output dependencies (parsl.log) 2025-10-26 10:29:46.467484 MainThread@117098 Task 358: Added output dependencies (parsl.log) 2025-10-26 10:29:46.467550 MainThread@117098 Task 358: Gathering dependencies: start (parsl.log) 2025-10-26 10:29:46.467620 MainThread@117098 Task 358: Gathering dependencies: end (parsl.log) 2025-10-26 10:29:46.467685 MainThread@117098 Task 358: submitted for App random_uuid, not waiting on any dependency (parsl.log) 2025-10-26 10:29:46.467752 MainThread@117098 Task 358: has AppFuture: (parsl.log) 2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log) 2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log) 2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log) 2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log) 2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log) 2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log) 2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log) 2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log) 2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log) 2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log) 2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log) 2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log) 2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log)}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$languagenoneuh%jh'h(h)K:hjh&hubhx)}(h`This integrates four log files and two task identifier systems into a single sequence of events.h]h`This integrates four log files and two task identifier systems into a single sequence of events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)KThjh&hubeh}(h]a-motivating-use-caseah]h]a motivating use caseah]h!]uh%h*hh,h&hh'h(h)K0ubh+)}(hhh](h0)}(h case study 2h]h case study 2}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)KWubhx)}(hFDebugging monitoring performance as part of developing this prototype.h]hFDebugging monitoring performance as part of developing this prototype.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)KYhjh&hubhx)}(hffindcommon tool - finds common task sequence for templated logs and outputs their sequence, like this:h]hffindcommon tool - finds common task sequence for templated logs and outputs their sequence, like this:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)K[hjh&hubhx)}(hFirst run parsl-perf like this:h]hFirst run parsl-perf like this:}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)K]hjh&hubj)}(hXxparsl-perf --config parsl/tests/configs/htex_local.py [...] ==== Iteration 3 ==== Will run 58179 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 58179 tasks submitted ... waiting for completion Submission took 103.880 seconds = 560.059 tasks/second Runtime: actual 137.225s vs target 120s Tasks per second: 423.967 Tests complete - leaving DFK blockh]hXxparsl-perf --config parsl/tests/configs/htex_local.py [...] ==== Iteration 3 ==== Will run 58179 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 58179 tasks submitted ... waiting for completion Submission took 103.880 seconds = 560.059 tasks/second Runtime: actual 137.225s vs target 120s Tasks per second: 423.967 Tests complete - leaving DFK block}hj1sbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)K_hjh&hubhx)}(h*which executes a total around 60000 tasks.h]h*which executes a total around 60000 tasks.}(hjCh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Knhjh&hubhx)}(hyFirst, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at .h]hyFirst, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at .}(hjQh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Kphjh&hubhx)}(hXZThat's perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this.h]hX\That’s perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this.}(hj_h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Krhjh&hubhx)}(hXFrom that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what's taking up time there?h]hXFrom that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what’s taking up time there?}(hjmh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Kthjh&hubhx)}(heRun findcommon (a could-be-modular-but-isn't helper from this observability prototype) on the result:h]hgRun findcommon (a could-be-modular-but-isn’t helper from this observability prototype) on the result:}(hj{h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Kvhjh&hubj)}(hX0.0: Task %s: will be sent to executor htex_local 0.00023320618468031343: Task %s: Adding output dependencies 0.0004515730863634116: Task %s: Added output dependencies 0.000672943356177761: Task %s: Gathering dependencies: start 0.0008952160973877195: Task %s: Gathering dependencies: end 0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency 0.001316777690507145: Task %s: has AppFuture: %s 0.0015680651123983979: Task %s: initializing state to pending 23.684763520758917: HTEX task %s: putting onto pending_task_queue 23.68483662049256: HTEX task %s: fetched task 23.684863335335613: Task %s: changing state from pending to launched 23.6850573607536: Task %s: try %s launched on executor %s with executor id %s 23.685248910492184: Task %s: Standard out will not be redirected. 23.685424046734745: Task %s: Standard error will not be redirected. 23.686276226995773: Putting HTEX task %s into scheduler 23.686777094898495: HTEX task %s: received executor task 23.687025900194147: HTEX task %s: Completed task 23.687268549254735: HTEX task %s: All processing finished for task 23.687837933843614: HTEX task %s: Manager %r: Removing task from manager 23.688483699079185: Task %s: changing state from launched to exec_doneh]hX0.0: Task %s: will be sent to executor htex_local 0.00023320618468031343: Task %s: Adding output dependencies 0.0004515730863634116: Task %s: Added output dependencies 0.000672943356177761: Task %s: Gathering dependencies: start 0.0008952160973877195: Task %s: Gathering dependencies: end 0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency 0.001316777690507145: Task %s: has AppFuture: %s 0.0015680651123983979: Task %s: initializing state to pending 23.684763520758917: HTEX task %s: putting onto pending_task_queue 23.68483662049256: HTEX task %s: fetched task 23.684863335335613: Task %s: changing state from pending to launched 23.6850573607536: Task %s: try %s launched on executor %s with executor id %s 23.685248910492184: Task %s: Standard out will not be redirected. 23.685424046734745: Task %s: Standard error will not be redirected. 23.686276226995773: Putting HTEX task %s into scheduler 23.686777094898495: HTEX task %s: received executor task 23.687025900194147: HTEX task %s: Completed task 23.687268549254735: HTEX task %s: All processing finished for task 23.687837933843614: HTEX task %s: Manager %r: Removing task from manager 23.688483699079185: Task %s: changing state from launched to exec_done}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)Kxhjh&hubhx)}(hXHIn this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That's fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf's submission time to completion time (30 seconds).h]hXLIn this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That’s fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf’s submission time to completion time (30 seconds).}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hWhat's useful with findcommon's output is that it shows the insides of Parsl's working in more depth: 20 states instead of parsl-perf's start, submitted, end. And the potential exists to calculate other statistics on these events.h]hWhat’s useful with findcommon’s output is that it shows the insides of Parsl’s working in more depth: 20 states instead of parsl-perf’s start, submitted, end. And the potential exists to calculate other statistics on these events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hSo in this average case, there's something slow happening between setting the task to pending, and then the task "simultaneously" being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue.h]hSo in this average case, there’s something slow happening between setting the task to pending, and then the task “simultaneously” being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hgThat's a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange.h]hiThat’s a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hX3So let's perform some deeper investigations -- observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master).h]hX6So let’s perform some deeper investigations – observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master).}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hLet's flesh out the whole submission process with some more log lines. On the DFK side, that's pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task.h]hLet’s flesh out the whole submission process with some more log lines. On the DFK side, that’s pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hFor example, here's the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task.h]hFor example, here’s the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubj)}(h+ task_logger.debug("TMP: dependencies added, calling launch_if_ready") self.launch_if_ready(task_record) + task_logger.debug("TMP: launch_if_ready returned")h]h+ task_logger.debug("TMP: dependencies added, calling launch_if_ready") self.launch_if_ready(task_record) + task_logger.debug("TMP: launch_if_ready returned")}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)Khjh&hubhx)}(hMy suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I'm going to add more logging around that.h]hMy suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I’m going to add more logging around that.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hThen rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor:h]hThen rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubj)}(hX0.0020453477688227: Task %s: TMP: submitted into launch pool executor 0.002256870306434224: Task %s: TMP: launch_if_ready returned 14.073021359217009: Task %s: TMP: before submitter lock [...] 14.078550367412324: Task %s: changing state from launched to exec_doneh]hX0.0020453477688227: Task %s: TMP: submitted into launch pool executor 0.002256870306434224: Task %s: TMP: launch_if_ready returned 14.073021359217009: Task %s: TMP: before submitter lock [...] 14.078550367412324: Task %s: changing state from launched to exec_done}hj+sbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)Khjh&hubhx)}(hDon't worry too much about the final time (14s) changing from 23s in the earlier run -- that's a characteristic of parsl-perf batch sizes that I'm working on in another branch.h]hDon’t worry too much about the final time (14s) changing from 23s in the earlier run – that’s a characteristic of parsl-perf batch sizes that I’m working on in another branch.}(hj=h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(h{If that's the case, I'd expect the thread pool executor, previously much faster than htex, to show similar characteristics:h]hIf that’s the case, I’d expect the thread pool executor, previously much faster than htex, to show similar characteristics:}(hjKh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hX\surprisingly, though although the throughput is not much much higher... the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn't so much waiting there - less than a second on average. That's possibly because the executor can get through tasks much faster so the queue doesn't build up so much?h]hXbsurprisingly, though although the throughput is not much much higher… the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn’t so much waiting there - less than a second on average. That’s possibly because the executor can get through tasks much faster so the queue doesn’t build up so much?}(hjYh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubj)}(hX==== Iteration 2 ==== Will run 68976 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 68976 tasks submitted ... waiting for completion Submission took 117.915 seconds = 584.965 tasks/second Runtime: actual 118.417s vs target 120s Tasks per second: 582.485h]hX==== Iteration 2 ==== Will run 68976 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 68976 tasks submitted ... waiting for completion Submission took 117.915 seconds = 584.965 tasks/second Runtime: actual 118.417s vs target 120s Tasks per second: 582.485}hjgsbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)Khjh&hubj)}(hX0.0: Task %s: will be sent to executor threads 0.00014157412110423425: Task %s: Adding output dependencies 0.0002898652725047201: Task %s: Added output dependencies 0.000425118042214259: Task %s: Gathering dependencies: start 0.0005696294991521399: Task %s: Gathering dependencies: end 0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency 0.0008433702196425292: Task %s: has AppFuture: %s 0.0010710284919573986: Task %s: initializing state to pending 0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready 0.0012973675719411494: Task %s: submitting into launch pool executor 0.0014397921284467212: Task %s: submitted into launch pool executor 0.0015767665501452072: Task %s: TMP: launch_if_ready returned 0.3143575128217656: Task %s: before submitter lock 0.31448896150771743: Task %s: after submitter lock, before executor.submit 0.3146383380777917: Task %s: after before executor.submit 0.3147926810507091: Task %s: changing state from pending to launched 0.3149239369413048: Task %s: try 0 launched on executor threads 0.31504996538376506: Task %s: Standard out will not be redirected. 0.31504996538376506: Task %s: Standard out will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.315319734920821: Task %s: changing state from launched to exec_doneh]hX0.0: Task %s: will be sent to executor threads 0.00014157412110423425: Task %s: Adding output dependencies 0.0002898652725047201: Task %s: Added output dependencies 0.000425118042214259: Task %s: Gathering dependencies: start 0.0005696294991521399: Task %s: Gathering dependencies: end 0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency 0.0008433702196425292: Task %s: has AppFuture: %s 0.0010710284919573986: Task %s: initializing state to pending 0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready 0.0012973675719411494: Task %s: submitting into launch pool executor 0.0014397921284467212: Task %s: submitted into launch pool executor 0.0015767665501452072: Task %s: TMP: launch_if_ready returned 0.3143575128217656: Task %s: before submitter lock 0.31448896150771743: Task %s: after submitter lock, before executor.submit 0.3146383380777917: Task %s: after before executor.submit 0.3147926810507091: Task %s: changing state from pending to launched 0.3149239369413048: Task %s: try 0 launched on executor threads 0.31504996538376506: Task %s: Standard out will not be redirected. 0.31504996538376506: Task %s: Standard out will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.315319734920821: Task %s: changing state from launched to exec_done}hjysbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)Khjh&hubhx)}(hXSo maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I'll have to implement that for this observability prototype, because it's not done already, but once it's done it should be reusable. and it should share most infrastructure with `findcommon`. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have "enough" tasks at all times to keep dispatching. With 8 cores on my laptop, I'd like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It's also important to note that this findcommon output shows latency, not throughput -- though high latency at particular points is an indication of throughput problems.h](hXSo maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I’ll have to implement that for this observability prototype, because it’s not done already, but once it’s done it should be reusable. and it should share most infrastructure with }(hjh&hh'Nh)Nubh title_reference)}(h `findcommon`h]h findcommon}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jhjubhXj. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have “enough” tasks at all times to keep dispatching. With 8 cores on my laptop, I’d like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It’s also important to note that this findcommon output shows latency, not throughput – though high latency at particular points is an indication of throughput problems.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hXgOr, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they're a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation.h]hXiOr, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they’re a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubhx)}(hiAdd a new log line like this into the interchange at a suitable point (after task queueing, for example):h]hiAdd a new log line like this into the interchange at a suitable point (after task queueing, for example):}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubj)}(h+ ql = len(self.pending_task_queue) + logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql})h]h+ ql = len(self.pending_task_queue) + logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql})}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$jnoneuh%jh'h(h)Khjh&hubhx)}(hNow can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script:h]hNow can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Khjh&hubj)}(hXimport matplotlib.pyplot as plt from parsl.observability.getlogs import getlogs logs = getlogs() # looking for these logs: # "metric": "pending_task_queue_length", "queued_tasks": ql}) metrics = [(float(l['created']), int(l['queued_tasks'])) for l in logs if 'metric' in l and l['metric'] == "pending_task_queue_length" ] plt.scatter(x=[m[0] for m in metrics], y=[m[1] for m in metrics]) plt.show()h]hXimport matplotlib.pyplot as plt from parsl.observability.getlogs import getlogs logs = getlogs() # looking for these logs: # "metric": "pending_task_queue_length", "queued_tasks": ql}) metrics = [(float(l['created']), int(l['queued_tasks'])) for l in logs if 'metric' in l and l['metric'] == "pending_task_queue_length" ] plt.scatter(x=[m[0] for m in metrics], y=[m[1] for m in metrics]) plt.show()}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$jpythonuh%jh'h(h)Khjh&hubhx)}(htand indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10.h]htand indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubhx)}(hXCThat's enough for now, but it's a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a `for`-loop doesn't progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case.h](hThat’s enough for now, but it’s a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a }(hj h&hh'Nh)Nubj)}(h`for`h]hfor}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jhj ubhX{-loop doesn’t progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case.}(hj h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubhx)}(hThis "understand the queue lengths (or implicit queue lengths) towards execution" investigation style has been useful in understanding Parsl performance limitations in the past.h]hThis “understand the queue lengths (or implicit queue lengths) towards execution” investigation style has been useful in understanding Parsl performance limitations in the past.}(hj)h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubeh}(h] case-study-2ah]h] case study 2ah]h!]uh%h*hh,h&hh'h(h)KWubh+)}(hhh](h0)}(hConfigurabilityh]hConfigurability}(hjBh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj?h&hh'h(h)Mubhx)}(hA soft start is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn't matter performance-wise too much about adding in the extra stuff on log calls.h]hA soft start is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn’t matter performance-wise too much about adding in the extra stuff on log calls.}(hjPh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M hj?h&hubhx)}(hXThe current parsl stuff is not set up for arbitrary log configuration outside of the submit-side process: for example, the worker helpers don't do any log config at all and rely on their enclosing per-executor environments to do it, which i think some do not.h]hXThe current parsl stuff is not set up for arbitrary log configuration outside of the submit-side process: for example, the worker helpers don’t do any log config at all and rely on their enclosing per-executor environments to do it, which i think some do not.}(hj^h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M hj?h&hubhx)}(hYhtex interchange and worker logs have a hardcoded log config with a single debug boolean.h]hYhtex interchange and worker logs have a hardcoded log config with a single debug boolean.}(hjlh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhj?h&hubhx)}(hI'd like to do something a bit more flexible than adding more parameters, that reflect that in the future people might want to configure their handlers differently rather than using the JSONHandler.h]hI’d like to do something a bit more flexible than adding more parameters, that reflect that in the future people might want to configure their handlers differently rather than using the JSONHandler.}(hjzh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhj?h&hubeh}(h]configurabilityah]h]configurabilityah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(h*Relation to monitoring and logging systemsh]h*Relation to monitoring and logging systems}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubhx)}(hParsl uses python's logging module. Parsl has its own monitoring system. This work builds on both of those, but changes how thye are used significantly.h]hParsl uses python’s logging module. Parsl has its own monitoring system. This work builds on both of those, but changes how thye are used significantly.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubeh}(h]*relation-to-monitoring-and-logging-systemsah]h]*relation to monitoring and logging systemsah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(h Why observability? vs monitoringh]h Why observability? vs monitoring}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubhx)}(hXhOriginal monitoring prototype was focused on what is happening with Parsl user level concepts: tasks, blocks for example as they move through simple states. Anything deeper is part of the idea of "Parsl makes it so you don't have to think about anything happening inside". Which is not how things are in reality: neither for code reliabilty or for performance.h]hXnOriginal monitoring prototype was focused on what is happening with Parsl user level concepts: tasks, blocks for example as they move through simple states. Anything deeper is part of the idea of “Parsl makes it so you don’t have to think about anything happening inside”. Which is not how things are in reality: neither for code reliabilty or for performance.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M!hjh&hubhx)}(h3Parsl Monitoring is too strict in a couple of ways:h]h3Parsl Monitoring is too strict in a couple of ways:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M(hjh&hubhx)}(hThe data model is fairly hard-coded into the architecture: specific SQL schema, specific message formats and specific places where those messages are sent.h]hThe data model is fairly hard-coded into the architecture: specific SQL schema, specific message formats and specific places where those messages are sent.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M*hjh&hubhx)}(hThe transmission model is real-time. Even with recent radio plugins, the assumption is still that messages will arrive soon after being sent.h]hThe transmission model is real-time. Even with recent radio plugins, the assumption is still that messages will arrive soon after being sent.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M.hjh&hubhx)}(hXThe almost-real-time data transmisison model is especially awkward when combined with SQL: distributed system events will arrive at different times or in the original UDP model perhaps not at all, and the "first" message that creates a task (for the purposes of the database) might arrive after some secondary data that requires that primary key to exist. yes, it's nice for the SQL database to follow foreign key rules, especially when looking at the data "afterwards" but that's not realistic for distributed unreliable events.h]hXThe almost-real-time data transmisison model is especially awkward when combined with SQL: distributed system events will arrive at different times or in the original UDP model perhaps not at all, and the “first” message that creates a task (for the purposes of the database) might arrive after some secondary data that requires that primary key to exist. yes, it’s nice for the SQL database to follow foreign key rules, especially when looking at the data “afterwards” but that’s not realistic for distributed unreliable events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M2hjh&hubhx)}(hContrast this to:h]hContrast this to:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M<hjh&hubh)}(hhh](h)}(hxarbitrary logs that may be different for different kinds of execution - for example, different executor implementations h]hx)}(hwarbitrary logs that may be different for different kinds of execution - for example, different executor implementationsh]hwarbitrary logs that may be different for different kinds of execution - for example, different executor implementations}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M>hjubah}(h]h]h]h]h!]uh%hh'h(h)M>hjh&hubh)}(hX pouring over these logs "later" - there's no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, "send me a tarball of your runinfo" is a standard technique. h]hx)}(hXpouring over these logs "later" - there's no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, "send me a tarball of your runinfo" is a standard technique.h]hXpouring over these logs “later” - there’s no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, “send me a tarball of your runinfo” is a standard technique.}(hj;h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MAhj7ubah}(h]h]h]h]h!]uh%hh'h(h)MAhjh&hubeh}(h]h]h]h]h!]jjuh%hh'h(h)M>hjh&hubhx)}(hXxParsl Monitoring is not well suited to adding in new ad-hoc events, perhaps just for one off debugging cases that will be thrown away: schema modifications in both SQL and in message tasks, and rearranging code to accomodate that is a serious business. Contrast logging: there's always a logger in any part of Parsl, and you can ``logger.debug("hello")`` pretty much anywhere.h](hXKParsl Monitoring is not well suited to adding in new ad-hoc events, perhaps just for one off debugging cases that will be thrown away: schema modifications in both SQL and in message tasks, and rearranging code to accomodate that is a serious business. Contrast logging: there’s always a logger in any part of Parsl, and you can }(hjUh&hh'Nh)Nubh literal)}(h``logger.debug("hello")``h]hlogger.debug("hello")}(hj_h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j]hjUubh pretty much anywhere.}(hjUh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)MGhjh&hubhx)}(hParsl Monitoring was also implemented with a fixed queries / dashboard mindset: one set of views that is expected to be sufficient. As time has shown, people like to make other outputs from this data.h]hParsl Monitoring was also implemented with a fixed queries / dashboard mindset: one set of views that is expected to be sufficient. As time has shown, people like to make other outputs from this data.}(hjwh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MOhjh&hubhx)}(h\want to debug/profile whats happening *inside parsl* rather than *inside the user workflow*.h](h&want to debug/profile whats happening }(hjh&hh'Nh)Nubh emphasis)}(h*inside parsl*h]h inside parsl}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jhjubh rather than }(hjh&hh'Nh)Nubj)}(h*inside the user workflow*h]hinside the user workflow}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jhjubh.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)MThjh&hubeh}(h]why-observability-vs-monitoringah]h] why observability? vs monitoringah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(hOther componentsh]hOther components}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)MXubhx)}(hSome components aren't Parsl-aware: for example work queue has no notion of a Parsl task ID. and it runs its own logging system, that is not Python, and so not amenable to Python monitoring radios.h]hSome components aren’t Parsl-aware: for example work queue has no notion of a Parsl task ID. and it runs its own logging system, that is not Python, and so not amenable to Python monitoring radios.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MZhjh&hubhx)}(hAnother example: swap out interchange impl for a different one with a different internal model: a schema of events for task progress through the original interchange doesn't necessarily work for some other implementation.h]hAnother example: swap out interchange impl for a different one with a different internal model: a schema of events for task progress through the original interchange doesn’t necessarily work for some other implementation.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M_hjh&hubhx)}(hlZMQ generates log messages which have been useful sometimes and these could be gatewayed into observability.h]hlZMQ generates log messages which have been useful sometimes and these could be gatewayed into observability.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mdhjh&hubhx)}(hXLots of different identifier spaces, loosely structured, not necessarily hierarchical: for example, an htex task is not necessarily "inside" a Parsl task, as htex can be used outside of a Parsl DFK (which is where the notion of Parsl task lives). An htex task often runs in a unix process but that process also runs many htex tasks, and an htex task also has extent outside of that worker process: there's no containment relationship either way.h]hXLots of different identifier spaces, loosely structured, not necessarily hierarchical: for example, an htex task is not necessarily “inside” a Parsl task, as htex can be used outside of a Parsl DFK (which is where the notion of Parsl task lives). An htex task often runs in a unix process but that process also runs many htex tasks, and an htex task also has extent outside of that worker process: there’s no containment relationship either way.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mghjh&hubeh}(h]other-componentsah]h]other componentsah]h!]uh%h*hh,h&hh'h(h)MXubh+)}(hhh](h0)}(h Optional pieces of observabilityh]h Optional pieces of observability}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mqubhx)}(hlog levels - INFO vs DEBUGh]hlog levels - INFO vs DEBUG}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mshjh&hubhx)}(hOmissing log files - eg. start with parsl.log, add in more files for more detailh]hOmissing log files - eg. start with parsl.log, add in more files for more detail}(hj1h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Muhjh&hubhx)}(hsecurity - not so much the Parsl core use case, but eg GC executor vs GC endpoint logs vs GC central services have different security properties.h]hsecurity - not so much the Parsl core use case, but eg GC executor vs GC endpoint logs vs GC central services have different security properties.}(hj?h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mxhjh&hubhx)}(hThe observability approach needs to accomodate that, for any/all reasons, some events won't be there. There can't be a "complete set of events" to complain about being incomplete.h]hThe observability approach needs to accomodate that, for any/all reasons, some events won’t be there. There can’t be a “complete set of events” to complain about being incomplete.}(hjMh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M|hjh&hubhx)}(huless data, well the reports in whatever form are less informative, to the extent that the lack of data makes them so.h]huless data, well the reports in whatever form are less informative, to the extent that the lack of data makes them so.}(hj[h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubeh}(h] optional-pieces-of-observabilityah]h] optional pieces of observabilityah]h!]uh%h*hh,h&hh'h(h)Mqubh+)}(hhh](h0)}(hScope for other hackingh]hScope for other hacking}(hjth&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjqh&hh'h(h)Mubhx)}(hshould be easy to add other events - the core observability model shouldn't be prescriptive about what events exist, what they look like.h]hshould be easy to add other events - the core observability model shouldn’t be prescriptive about what events exist, what they look like.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(h&should be easy to use them in analysish]h&should be easy to use them in analysis}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(hDshould be easy to import some other event stream, in whatever formath]hDshould be easy to import some other event stream, in whatever format}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(hXnotions of moving logs around to a place of analysis should not be baked into the architecture. realtime options, file based options, ... - that is an area for experimentation (see Chronolog) and this work should facilitate that rather than being prescriptiveh]hXnotions of moving logs around to a place of analysis should not be baked into the architecture. realtime options, file based options, … - that is an area for experimentation (see Chronolog) and this work should facilitate that rather than being prescriptive}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(hstorage and query of logs is also an area for experimentation. there are lots of hosted commercial services. lots of small scale stuff:h]hstorage and query of logs is also an area for experimentation. there are lots of hosted commercial services. lots of small scale stuff:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(hX2eg. at small enough scale, parsing logs into a python session and using e.g. set and list comprehensions is a legitimate way to analyse things (rather than something awkwardly shameful that will be replaced by The Real Thing later) - especially given Parsl users general exposure to data science in Python.h]hX2eg. at small enough scale, parsing logs into a python session and using e.g. set and list comprehensions is a legitimate way to analyse things (rather than something awkwardly shameful that will be replaced by The Real Thing later) - especially given Parsl users general exposure to data science in Python.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(hXIgnore Parsl Monitoring per-task resource monitoring and do something else that generates similar observability records. This was always some disappointment with getting WQ resource monitoring into the Parsl monitoring database: what exists there that could be imported?h]hXIgnore Parsl Monitoring per-task resource monitoring and do something else that generates similar observability records. This was always some disappointment with getting WQ resource monitoring into the Parsl monitoring database: what exists there that could be imported?}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubhx)}(hInside Python parts of Parsl, this data *is* available in realtime at the point of logging as it goes to whatever LogHandler is running in each python process. that isn't true in general on the "event model" side of things, though.h](h(Inside Python parts of Parsl, this data }(hjh&hh'Nh)Nubj)}(h*is*h]his}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jhjubh available in realtime at the point of logging as it goes to whatever LogHandler is running in each python process. that isn’t true in general on the “event model” side of things, though.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)Mhjqh&hubeh}(h]scope-for-other-hackingah]h]scope for other hackingah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(hTarget audienceh]hTarget audience}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mubhx)}(h serious debugger/profiler peopleh]h serious debugger/profiler people}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhj h&hubhx)}(h|not management-dashboard types - although management dashboards absolutely should be creatable with this observability data.h]h|not management-dashboard types - although management dashboards absolutely should be creatable with this observability data.}(hj+h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhj h&hubeh}(h]target-audienceah]h]target audienceah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(hBuild your own stackh]hBuild your own stack}(hjDh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjAh&hh'h(h)Mubhx)}(hX|Lots of observability commentary online talks as if you are building your entire stack, to the extent that you care about observability. Parsl is much more a pile of configurable components stuck together, all with their own different options for observability/logging/monitoring, and without easy ability for someone to add a consistent model throughout the entire stack of code.h]hX|Lots of observability commentary online talks as if you are building your entire stack, to the extent that you care about observability. Parsl is much more a pile of configurable components stuck together, all with their own different options for observability/logging/monitoring, and without easy ability for someone to add a consistent model throughout the entire stack of code.}(hjRh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MhjAh&hubeh}(h]build-your-own-stackah]h]build your own stackah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(hAlgebra of merging logsh]hAlgebra of merging logs}(hjkh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjhh&hh'h(h)Mubhx)}(hwideningh]hwidening}(hjyh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubhx)}(hjoiningh]hjoining}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubhx)}(h%post-facto relationship establishmenth]h%post-facto relationship establishment}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubhx)}(h relabellingh]h relabelling}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubhx)}(h4look at relational algebra for phrasing and conceptsh]h4look at relational algebra for phrasing and concepts}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubhx)}(hXnotion of identity and key-sequences: eg. parsl_dfk/parsl_task_id is a globally unique identifier for a parsl task across time and space, and so is parsl_dfk/executor_label/block_number or parsl_dfk/executor_label/manager_id/worker_number -- although manager ID is also (in short form) globally unique. this is distinct from the hierarchical relations between entities - although hierarchical identity keys will often line up with execution hierarchy.h]hXnotion of identity and key-sequences: eg. parsl_dfk/parsl_task_id is a globally unique identifier for a parsl task across time and space, and so is parsl_dfk/executor_label/block_number or parsl_dfk/executor_label/manager_id/worker_number – although manager ID is also (in short form) globally unique. this is distinct from the hierarchical relations between entities - although hierarchical identity keys will often line up with execution hierarchy.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubhx)}(hepeter buneman XML keys stuff did nested sequences of keys for identifying xml fragments, c. year 2000h]hepeter buneman XML keys stuff did nested sequences of keys for identifying xml fragments, c. year 2000}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjhh&hubeh}(h]algebra-of-merging-logsah]h]algebra of merging logsah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(hWho wants this?h]hWho wants this?}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubhx)}(h At least me.h]h At least me.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubhx)}(hMany users don't come explicitly asking for monitoring-style information but do ask how to understand whats going on inside. And then are excited to use monitoring when it exists.h]hMany users don’t come explicitly asking for monitoring-style information but do ask how to understand whats going on inside. And then are excited to use monitoring when it exists.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubeh}(h]who-wants-thisah]h]who wants this?ah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(h Browser UIh]h Browser UI}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubhx)}(h+what might a browser UI look like for this?h]h+what might a browser UI look like for this?}(hj)h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubhx)}(hcompare parsl-visualize. compare scrolling through logs, but with some more interactivity (eg. click / choose "show me logs from same dfk/task_id")h]hcompare parsl-visualize. compare scrolling through logs, but with some more interactivity (eg. click / choose “show me logs from same dfk/task_id”)}(hj7h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjh&hubeh}(h] browser-uiah]h] browser uiah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(h4Performance measurement of patch stack on 2025-10-27h]h4Performance measurement of patch stack on 2025-10-27}(hjPh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjMh&hh'h(h)Mubj)}(h_pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000h]h_pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000}hj^sbah}(h]h]h]h]h!]forcehighlight_args}h#h$jdefaultuh%jh'h(h)MhjMh&hubhx)}(hSRunning parsl-perf with constant block sizes (to avoid queue length speed changes):h]hSRunning parsl-perf with constant block sizes (to avoid queue length speed changes):}(hjph&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MhjMh&hubhx)}(hAmaster branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tpsh]hAmaster branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tps}(hj~h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MhjMh&hubhx)}(hmore-task-tied-logs: 1024h]hmore-task-tied-logs: 1024}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MhjMh&hubh definition_list)}(hhh]h definition_list_item)}(hLjson-wide-log-records: 537 - but without initializing the JSONHandler: 1122 h](h term)}(hjson-wide-log-records: 537h]hjson-wide-log-records: 537}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jh'h(h)Mhjubh definition)}(hhh]h)}(hhh]h)}(h/but without initializing the JSONHandler: 1122 h]hx)}(h.but without initializing the JSONHandler: 1122h]h.but without initializing the JSONHandler: 1122}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhjubah}(h]h]h]h]h!]uh%hh'h(h)Mhjubah}(h]h]h]h]h!]j-uh%hh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhjubeh}(h]h]h]h]h!]uN8h%jh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)MhjMh&hubhx)}(h-end of branch with all changes up to now: 385h]h-end of branch with all changes up to now: 385}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MhjMh&hubeh}(h]4performance-measurement-of-patch-stack-on-2025-10-27ah]h]4performance measurement of patch stack on 2025-10-27ah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(h"Applying this approach for academyh]h"Applying this approach for academy}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mubhx)}(hXAs an extreme "data might not be there" -- perhaps Parsl isn't there at all. What does this code and these techniques look like applied to a similar but very different codebase, Academy, which doesn't have any distributed monitoring at all at the moment. There are ~100 log lines in the academy codebase right now. How much can this be converted in a few hours, and then analysed in similar ways?h]hXAs an extreme “data might not be there” – perhaps Parsl isn’t there at all. What does this code and these techniques look like applied to a similar but very different codebase, Academy, which doesn’t have any distributed monitoring at all at the moment. There are ~100 log lines in the academy codebase right now. How much can this be converted in a few hours, and then analysed in similar ways?}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M hj h&hubhx)}(h~The point here being both considering this as a real logging direction for academy, and as a proof-of-generality beyond Parsl.h]h~The point here being both considering this as a real logging direction for academy, and as a proof-of-generality beyond Parsl.}(hj# h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhj h&hubhx)}(h thoughts:h]h thoughts:}(hj1 h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)Mhj h&hubhx)}(hXrepr-of-ID-object might not be the correct format for logging: I want stuff that is nice strings for values, but repr (although it is a string) is more designed to look like a python code fragment rather than the core value of an object. Maybe ``str`` is better, and maybe some other way of representing the ID is better? The point is to have values that work well in aggregate, database style analysis, not easy on the human eye.h](hrepr-of-ID-object might not be the correct format for logging: I want stuff that is nice strings for values, but repr (although it is a string) is more designed to look like a python code fragment rather than the core value of an object. Maybe }(hj? h&hh'Nh)Nubj^)}(h``str``h]hstr}(hjG h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j]hj? ubh is better, and maybe some other way of representing the ID is better? The point is to have values that work well in aggregate, database style analysis, not easy on the human eye.}(hj? h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)Mhj h&hubhx)}(hacademy logging so far focused on looking pretty on the console: eg ANSI colour - that's at the opposite end of the spectrum to what this observability project is trying to log.h]hacademy logging so far focused on looking pretty on the console: eg ANSI colour - that’s at the opposite end of the spectrum to what this observability project is trying to log.}(hj_ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M hj h&hubhx)}(hxrule of thumb for initial conversion: whatever is substituted into the human message should be added as an extras field.h]hxrule of thumb for initial conversion: whatever is substituted into the human message should be added as an extras field.}(hjm h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M$hj h&hubhx)}(huother components this might be applied to: Globus Compute. Various related workflow systems that sit on top of Parsl.h]huother components this might be applied to: Globus Compute. Various related workflow systems that sit on top of Parsl.}(hj{ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M'hj h&hubhx)}(hParsl contains two "mini-workflow-systems" on top of core Parsl: parsl-perf and pytest tests. It could be interesting to illustrate how those fit in without being a core part of Parsl observability.h]hParsl contains two “mini-workflow-systems” on top of core Parsl: parsl-perf and pytest tests. It could be interesting to illustrate how those fit in without being a core part of Parsl observability.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M*hj h&hubeh}(h]"applying-this-approach-for-academyah]h]"applying this approach for academyah]h!]uh%h*hh,h&hh'h(h)Mubh+)}(hhh](h0)}(hSee alsoh]hSee also}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)M0ubhx)}(h netloggerh]h netlogger}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M2hj h&hubhx)}(hmy dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component.h]hmy dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M4hj h&hubhx)}(h:chronolog https://grc.iit.edu/research/projects/chronolog/h](h chronolog }(hj h&hh'Nh)Nubj6)}(h0https://grc.iit.edu/research/projects/chronolog/h]h0https://grc.iit.edu/research/projects/chronolog/}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]refurij uh%j5hj ubeh}(h]h]h]h]h!]uh%hwh'h(h)M6hj h&hubhx)}(hsyslogh]hsyslog}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M8hj h&hubeh}(h]see-alsoah]h]see alsoah]h!]uh%h*hh,h&hh'h(h)M0ubh+)}(hhh](h0)}(hIn-parsl Python codingh]hIn-parsl Python coding}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)M;ubhx)}(h[Acknowledging observability as a first-order feature means we can make big changes to code.h]h[Acknowledging observability as a first-order feature means we can make big changes to code.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M=hj h&hubhx)}(hXEvery log message needs to be visited to add context. In many places a bunch of that context can be added by helpers: for example, in my prototype, some module level loggers are replaced by object-level loggers: there is a per-task logger (actually LoggerAdapter) in the TaskRecord, and logging to that automatically adds on relevant DFK and task metadata: at most log sites, the change to add that metadata is to switch from invoking methods on the module-level logger object, invoking them on the new task-level logger instead.h]hXEvery log message needs to be visited to add context. In many places a bunch of that context can be added by helpers: for example, in my prototype, some module level loggers are replaced by object-level loggers: there is a per-task logger (actually LoggerAdapter) in the TaskRecord, and logging to that automatically adds on relevant DFK and task metadata: at most log sites, the change to add that metadata is to switch from invoking methods on the module-level logger object, invoking them on the new task-level logger instead.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M@hj h&hubhx)}(hSome log lines bracket an operation, and to help with that, my prototype introduces a LexicalSpan context manager which can be used as part of a `with` block to identify the span of work starting and ending.h](hSome log lines bracket an operation, and to help with that, my prototype introduces a LexicalSpan context manager which can be used as part of a }(hj, h&hh'Nh)Nubj)}(h`with`h]hwith}(hj4 h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jhj, ubh8 block to identify the span of work starting and ending.}(hj, h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hwh'h(h)MJhj h&hubhx)}(hMove away from forming ad-hoc string templates and make log calls look more machine-readable. This is somewhat stylistic: with task ID automatically logged, there is no need to substitute in task ID in some arbitrary subset of task-related logs.h]hMove away from forming ad-hoc string templates and make log calls look more machine-readable. This is somewhat stylistic: with task ID automatically logged, there is no need to substitute in task ID in some arbitrary subset of task-related logs.}(hjL h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MOhj h&hubeh}(h]in-parsl-python-codingah]h]in-parsl python codingah]h!]uh%h*hh,h&hh'h(h)M;ubh+)}(hhh](h0)}(hAcknowledgementsh]hAcknowledgements}(hje h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjb h&hh'h(h)MVubhx)}(hchronolog: nishchay, innah]hchronolog: nishchay, inna}(hjs h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MXhjb h&hubhx)}(h/desc: esp david adams, tom glanzman, jim chiangh]h/desc: esp david adams, tom glanzman, jim chiang}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)MZhjb h&hubhx)}(h uiuc: vedh]h uiuc: ved}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M\hjb h&hubhx)}(h gc: kevinh]h gc: kevin}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M^hjb h&hubhx)}(h academy: alokh]h academy: alok}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hwh'h(h)M`hjb h&hubeh}(h]acknowledgementsah]h]acknowledgementsah]h!]uh%h*hh,h&hh'h(h)MVubeh}(h]/parsl-wide-event-observability-prototype-reportah]h]/parsl wide event observability prototype reportah]h!]uh%h*hhh&hh'h(h)Kubh+)}(hhh](h0)}(hIndices and tablesh]hIndices and tables}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mcubh)}(hhh](h)}(h:ref:`genindex`h]hx)}(hj h]h pending_xref)}(hj h]h inline)}(hj h]hgenindex}(hj h&hh'Nh)Nubah}(h]h](xrefstdstd-refeh]h]h!]uh%j hj ubah}(h]h]h]h]h!]refdochO refdomainj reftyperef refexplicitrefwarn reftargetgenindexuh%j h'h(h)Mehj ubah}(h]h]h]h]h!]uh%hwh'h(h)Mehj ubah}(h]h]h]h]h!]uh%hh'h(h)Mehj h&hubh)}(h:ref:`modindex`h]hx)}(hj h]j )}(hj h]j )}(hj h]hmodindex}(hj h&hh'Nh)Nubah}(h]h](j stdstd-refeh]h]h!]uh%j hj ubah}(h]h]h]h]h!]refdochO refdomainj) reftyperef refexplicitrefwarnj modindexuh%j h'h(h)Mfhj ubah}(h]h]h]h]h!]uh%hwh'h(h)Mfhj ubah}(h]h]h]h]h!]uh%hh'h(h)Mfhj h&hubh)}(h :ref:`search`h]hx)}(hjI h]j )}(hjI h]j )}(hjI h]hsearch}(hjQ h&hh'Nh)Nubah}(h]h](j stdstd-refeh]h]h!]uh%j hjN ubah}(h]h]h]h]h!]refdochO refdomainj[ reftyperef refexplicitrefwarnj searchuh%j h'h(h)MghjK ubah}(h]h]h]h]h!]uh%hwh'h(h)MghjG ubah}(h]h]h]h]h!]uh%hh'h(h)Mghj h&hubeh}(h]h]h]h]h!]jjuh%hh'h(h)Mehj h&hubeh}(h]indices-and-tablesah]h]indices and tablesah]h!]uh%h*hhh&hh'h(h)Mcubeh}(h]h]h]h]h!]sourceh(translation_progress}(totalK translatedKuuh%hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(outputNh/N generatorN datestampN root_prefix/ source_linkN source_urlN toc_backlinksentryfootnote_backlinks sectnum_xformstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerj error_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh( _destinationN _config_files]file_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}nameids}(j j hhhhjjjj|jjj<j9jjjjjjjj jnjkj jj>j;jejbjjjjjJjGj jj j j j j_ j\ j j j j u nametypes}(j hhjjjj<jjjjjnj j>jejjjJj j j j_ j j uh}(j h,hhfhhjhj|jjjj9jjj?jjjjj jjkjjjqj;j jbjAjjhjjjGjjjMj j j j j\ j j jb j j u footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}Rparse_messages]transform_messages] transformerN include_log] decorationNh&hub.