.. Parsl wide event observability prototype report documentation master file, created by sphinx-quickstart on Sun Oct 26 10:13:01 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Parsl wide event observability prototype report =============================================== .. toctree:: :maxdepth: 2 :caption: Contents: Introduction ------------ These are notes about my current iteration of observability. The Question ------------ Read this. give your feedback about if you think it should be the direction Parsl goes. Structure --------- Somewhere be clear that this is in two related parts: * emitting wide records from Parsl Python code * ingesting and analysing records, including wide ones, from many sources What is observability, in the Parsl context ------------------------------------------- "Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs." (https://en.wikipedia.org/wiki/Observability) how can we tell whats going on inside a Parsl run by what we can see externally? how can we expose more "stuff"? abstract notion, but in Parsl that is about: logging; Parsl monitoring. and this prototype is about doing that "better" A motivating use case --------------------- Here's a use case that is hard with what exists in master-branch Parsl right now. I want to know, for a particular arbitrary task, the timings of the task as it is submitted by the user workflow, flows through the DFK, into the htex interchange, worker pool, executes on an htex worker, and flows back to the user, with the timing of each step. What exists in master Parsl right now is some information in monitoring, and some information in log files. The monitoring information is focused on the high level task model, not what is happening inside Parsl to run that high level model. Logs as they exist now are extremely ad-hoc, spread around in at least 4 different places, and poorly integrated: for example, log messages sometimes do not contain context about which task they refer to, do not represent that context uniformly (e.g. in a greppable way) and are ambiguous about context (e.g. some places refer to task 1, the DFK-level task 1, and some places refer to task 1, the HTEX-level task 1, which could be something completely different). As a contrast, an example output of this prototype (as of 2025-10-26) is: .. code:: none === About task 358 === 2025-10-26 10:29:46.467298 MainThread@117098 Task 358: will be sent to executor htex_Local (parsl.log) 2025-10-26 10:29:46.467412 MainThread@117098 Task 358: Adding output dependencies (parsl.log) 2025-10-26 10:29:46.467484 MainThread@117098 Task 358: Added output dependencies (parsl.log) 2025-10-26 10:29:46.467550 MainThread@117098 Task 358: Gathering dependencies: start (parsl.log) 2025-10-26 10:29:46.467620 MainThread@117098 Task 358: Gathering dependencies: end (parsl.log) 2025-10-26 10:29:46.467685 MainThread@117098 Task 358: submitted for App random_uuid, not waiting on any dependency (parsl.log) 2025-10-26 10:29:46.467752 MainThread@117098 Task 358: has AppFuture: (parsl.log) 2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log) 2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log) 2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log) 2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log) 2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log) 2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log) 2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log) 2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log) 2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log) 2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log) 2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log) 2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log) 2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log) This integrates four log files and two task identifier systems into a single sequence of events. case study 2 ------------ Debugging monitoring performance as part of developing this prototype. findcommon tool - finds common task sequence for templated logs and outputs their sequence, like this: First run parsl-perf like this: .. code:: none parsl-perf --config parsl/tests/configs/htex_local.py [...] ==== Iteration 3 ==== Will run 58179 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 58179 tasks submitted ... waiting for completion Submission took 103.880 seconds = 560.059 tasks/second Runtime: actual 137.225s vs target 120s Tasks per second: 423.967 Tests complete - leaving DFK block which executes a total around 60000 tasks. First, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at . That's perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this. From that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what's taking up time there? Run findcommon (a could-be-modular-but-isn't helper from this observability prototype) on the result: .. code:: none 0.0: Task %s: will be sent to executor htex_local 0.00023320618468031343: Task %s: Adding output dependencies 0.0004515730863634116: Task %s: Added output dependencies 0.000672943356177761: Task %s: Gathering dependencies: start 0.0008952160973877195: Task %s: Gathering dependencies: end 0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency 0.001316777690507145: Task %s: has AppFuture: %s 0.0015680651123983979: Task %s: initializing state to pending 23.684763520758917: HTEX task %s: putting onto pending_task_queue 23.68483662049256: HTEX task %s: fetched task 23.684863335335613: Task %s: changing state from pending to launched 23.6850573607536: Task %s: try %s launched on executor %s with executor id %s 23.685248910492184: Task %s: Standard out will not be redirected. 23.685424046734745: Task %s: Standard error will not be redirected. 23.686276226995773: Putting HTEX task %s into scheduler 23.686777094898495: HTEX task %s: received executor task 23.687025900194147: HTEX task %s: Completed task 23.687268549254735: HTEX task %s: All processing finished for task 23.687837933843614: HTEX task %s: Manager %r: Removing task from manager 23.688483699079185: Task %s: changing state from launched to exec_done In this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That's fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf's submission time to completion time (30 seconds). What's useful with findcommon's output is that it shows the insides of Parsl's working in more depth: 20 states instead of parsl-perf's start, submitted, end. And the potential exists to calculate other statistics on these events. So in this average case, there's something slow happening between setting the task to pending, and then the task "simultaneously" being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue. That's a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange. So let's perform some deeper investigations -- observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master). Let's flesh out the whole submission process with some more log lines. On the DFK side, that's pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task. For example, here's the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task. .. code:: none + task_logger.debug("TMP: dependencies added, calling launch_if_ready") self.launch_if_ready(task_record) + task_logger.debug("TMP: launch_if_ready returned") My suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I'm going to add more logging around that. Then rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor: .. code:: none 0.0020453477688227: Task %s: TMP: submitted into launch pool executor 0.002256870306434224: Task %s: TMP: launch_if_ready returned 14.073021359217009: Task %s: TMP: before submitter lock [...] 14.078550367412324: Task %s: changing state from launched to exec_done Don't worry too much about the final time (14s) changing from 23s in the earlier run -- that's a characteristic of parsl-perf batch sizes that I'm working on in another branch. If that's the case, I'd expect the thread pool executor, previously much faster than htex, to show similar characteristics: surprisingly, though although the throughput is not much much higher... the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn't so much waiting there - less than a second on average. That's possibly because the executor can get through tasks much faster so the queue doesn't build up so much? .. code:: none ==== Iteration 2 ==== Will run 68976 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 68976 tasks submitted ... waiting for completion Submission took 117.915 seconds = 584.965 tasks/second Runtime: actual 118.417s vs target 120s Tasks per second: 582.485 .. code:: none 0.0: Task %s: will be sent to executor threads 0.00014157412110423425: Task %s: Adding output dependencies 0.0002898652725047201: Task %s: Added output dependencies 0.000425118042214259: Task %s: Gathering dependencies: start 0.0005696294991521399: Task %s: Gathering dependencies: end 0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency 0.0008433702196425292: Task %s: has AppFuture: %s 0.0010710284919573986: Task %s: initializing state to pending 0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready 0.0012973675719411494: Task %s: submitting into launch pool executor 0.0014397921284467212: Task %s: submitted into launch pool executor 0.0015767665501452072: Task %s: TMP: launch_if_ready returned 0.3143575128217656: Task %s: before submitter lock 0.31448896150771743: Task %s: after submitter lock, before executor.submit 0.3146383380777917: Task %s: after before executor.submit 0.3147926810507091: Task %s: changing state from pending to launched 0.3149239369413048: Task %s: try 0 launched on executor threads 0.31504996538376506: Task %s: Standard out will not be redirected. 0.31504996538376506: Task %s: Standard out will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.315319734920821: Task %s: changing state from launched to exec_done So maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I'll have to implement that for this observability prototype, because it's not done already, but once it's done it should be reusable. and it should share most infrastructure with `findcommon`. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have "enough" tasks at all times to keep dispatching. With 8 cores on my laptop, I'd like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It's also important to note that this findcommon output shows latency, not throughput -- though high latency at particular points is an indication of throughput problems. Or, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they're a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation. Add a new log line like this into the interchange at a suitable point (after task queueing, for example): .. code:: none + ql = len(self.pending_task_queue) + logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql}) Now can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script: .. code:: python import matplotlib.pyplot as plt from parsl.observability.getlogs import getlogs logs = getlogs() # looking for these logs: # "metric": "pending_task_queue_length", "queued_tasks": ql}) metrics = [(float(l['created']), int(l['queued_tasks'])) for l in logs if 'metric' in l and l['metric'] == "pending_task_queue_length" ] plt.scatter(x=[m[0] for m in metrics], y=[m[1] for m in metrics]) plt.show() and indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10. That's enough for now, but it's a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a `for`-loop doesn't progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case. This "understand the queue lengths (or implicit queue lengths) towards execution" investigation style has been useful in understanding Parsl performance limitations in the past. Configurability --------------- A soft start is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn't matter performance-wise too much about adding in the extra stuff on log calls. The current parsl stuff is not set up for arbitrary log configuration outside of the submit-side process: for example, the worker helpers don't do any log config at all and rely on their enclosing per-executor environments to do it, which i think some do not. htex interchange and worker logs have a hardcoded log config with a single debug boolean. I'd like to do something a bit more flexible than adding more parameters, that reflect that in the future people might want to configure their handlers differently rather than using the JSONHandler. Relation to monitoring and logging systems ------------------------------------------ Parsl uses python's logging module. Parsl has its own monitoring system. This work builds on both of those, but changes how thye are used significantly. Why observability? vs monitoring -------------------------------- Original monitoring prototype was focused on what is happening with Parsl user level concepts: tasks, blocks for example as they move through simple states. Anything deeper is part of the idea of "Parsl makes it so you don't have to think about anything happening inside". Which is not how things are in reality: neither for code reliabilty or for performance. Parsl Monitoring is too strict in a couple of ways: The data model is fairly hard-coded into the architecture: specific SQL schema, specific message formats and specific places where those messages are sent. The transmission model is real-time. Even with recent radio plugins, the assumption is still that messages will arrive soon after being sent. The almost-real-time data transmisison model is especially awkward when combined with SQL: distributed system events will arrive at different times or in the original UDP model perhaps not at all, and the "first" message that creates a task (for the purposes of the database) might arrive after some secondary data that requires that primary key to exist. yes, it's nice for the SQL database to follow foreign key rules, especially when looking at the data "afterwards" but that's not realistic for distributed unreliable events. Contrast this to: * arbitrary logs that may be different for different kinds of execution - for example, different executor implementations * pouring over these logs "later" - there's no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, "send me a tarball of your runinfo" is a standard technique. Parsl Monitoring is not well suited to adding in new ad-hoc events, perhaps just for one off debugging cases that will be thrown away: schema modifications in both SQL and in message tasks, and rearranging code to accomodate that is a serious business. Contrast logging: there's always a logger in any part of Parsl, and you can ``logger.debug("hello")`` pretty much anywhere. Parsl Monitoring was also implemented with a fixed queries / dashboard mindset: one set of views that is expected to be sufficient. As time has shown, people like to make other outputs from this data. want to debug/profile whats happening *inside parsl* rather than *inside the user workflow*. Other components ---------------- Some components aren't Parsl-aware: for example work queue has no notion of a Parsl task ID. and it runs its own logging system, that is not Python, and so not amenable to Python monitoring radios. Another example: swap out interchange impl for a different one with a different internal model: a schema of events for task progress through the original interchange doesn't necessarily work for some other implementation. ZMQ generates log messages which have been useful sometimes and these could be gatewayed into observability. Lots of different identifier spaces, loosely structured, not necessarily hierarchical: for example, an htex task is not necessarily "inside" a Parsl task, as htex can be used outside of a Parsl DFK (which is where the notion of Parsl task lives). An htex task often runs in a unix process but that process also runs many htex tasks, and an htex task also has extent outside of that worker process: there's no containment relationship either way. Optional pieces of observability -------------------------------- log levels - INFO vs DEBUG missing log files - eg. start with parsl.log, add in more files for more detail security - not so much the Parsl core use case, but eg GC executor vs GC endpoint logs vs GC central services have different security properties. The observability approach needs to accomodate that, for any/all reasons, some events won't be there. There can't be a "complete set of events" to complain about being incomplete. less data, well the reports in whatever form are less informative, to the extent that the lack of data makes them so. Scope for other hacking ----------------------- should be easy to add other events - the core observability model shouldn't be prescriptive about what events exist, what they look like. should be easy to use them in analysis should be easy to import some other event stream, in whatever format notions of moving logs around to a place of analysis should not be baked into the architecture. realtime options, file based options, ... - that is an area for experimentation (see Chronolog) and this work should facilitate that rather than being prescriptive storage and query of logs is also an area for experimentation. there are lots of hosted commercial services. lots of small scale stuff: eg. at small enough scale, parsing logs into a python session and using e.g. set and list comprehensions is a legitimate way to analyse things (rather than something awkwardly shameful that will be replaced by The Real Thing later) - especially given Parsl users general exposure to data science in Python. Ignore Parsl Monitoring per-task resource monitoring and do something else that generates similar observability records. This was always some disappointment with getting WQ resource monitoring into the Parsl monitoring database: what exists there that could be imported? Inside Python parts of Parsl, this data *is* available in realtime at the point of logging as it goes to whatever LogHandler is running in each python process. that isn't true in general on the "event model" side of things, though. Target audience --------------- serious debugger/profiler people not management-dashboard types - although management dashboards absolutely should be creatable with this observability data. Build your own stack -------------------- Lots of observability commentary online talks as if you are building your entire stack, to the extent that you care about observability. Parsl is much more a pile of configurable components stuck together, all with their own different options for observability/logging/monitoring, and without easy ability for someone to add a consistent model throughout the entire stack of code. Algebra of merging logs ----------------------- widening joining post-facto relationship establishment relabelling look at relational algebra for phrasing and concepts notion of identity and key-sequences: eg. parsl_dfk/parsl_task_id is a globally unique identifier for a parsl task across time and space, and so is parsl_dfk/executor_label/block_number or parsl_dfk/executor_label/manager_id/worker_number -- although manager ID is also (in short form) globally unique. this is distinct from the hierarchical relations between entities - although hierarchical identity keys will often line up with execution hierarchy. peter buneman XML keys stuff did nested sequences of keys for identifying xml fragments, c. year 2000 Who wants this? --------------- At least me. Many users don't come explicitly asking for monitoring-style information but do ask how to understand whats going on inside. And then are excited to use monitoring when it exists. Browser UI ---------- what might a browser UI look like for this? compare parsl-visualize. compare scrolling through logs, but with some more interactivity (eg. click / choose "show me logs from same dfk/task_id") Performance measurement of patch stack on 2025-10-27 ---------------------------------------------------- .. code:: pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000 Running parsl-perf with constant block sizes (to avoid queue length speed changes): master branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tps more-task-tied-logs: 1024 json-wide-log-records: 537 - but without initializing the JSONHandler: 1122 end of branch with all changes up to now: 385 Applying this approach for academy ---------------------------------- As an extreme "data might not be there" -- perhaps Parsl isn't there at all. What does this code and these techniques look like applied to a similar but very different codebase, Academy, which doesn't have any distributed monitoring at all at the moment. There are ~100 log lines in the academy codebase right now. How much can this be converted in a few hours, and then analysed in similar ways? The point here being both considering this as a real logging direction for academy, and as a proof-of-generality beyond Parsl. thoughts: repr-of-ID-object might not be the correct format for logging: I want stuff that is nice strings for values, but repr (although it is a string) is more designed to look like a python code fragment rather than the core value of an object. Maybe ``str`` is better, and maybe some other way of representing the ID is better? The point is to have values that work well in aggregate, database style analysis, not easy on the human eye. academy logging so far focused on looking pretty on the console: eg ANSI colour - that's at the opposite end of the spectrum to what this observability project is trying to log. rule of thumb for initial conversion: whatever is substituted into the human message should be added as an extras field. other components this might be applied to: Globus Compute. Various related workflow systems that sit on top of Parsl. Parsl contains two "mini-workflow-systems" on top of core Parsl: parsl-perf and pytest tests. It could be interesting to illustrate how those fit in without being a core part of Parsl observability. See also -------- netlogger my dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component. chronolog https://grc.iit.edu/research/projects/chronolog/ syslog In-parsl Python coding ---------------------- Acknowledging observability as a first-order feature means we can make big changes to code. Every log message needs to be visited to add context. In many places a bunch of that context can be added by helpers: for example, in my prototype, some module level loggers are replaced by object-level loggers: there is a per-task logger (actually LoggerAdapter) in the TaskRecord, and logging to that automatically adds on relevant DFK and task metadata: at most log sites, the change to add that metadata is to switch from invoking methods on the module-level logger object, invoking them on the new task-level logger instead. Some log lines bracket an operation, and to help with that, my prototype introduces a LexicalSpan context manager which can be used as part of a `with` block to identify the span of work starting and ending. Move away from forming ad-hoc string templates and make log calls look more machine-readable. This is somewhat stylistic: with task ID automatically logged, there is no need to substitute in task ID in some arbitrary subset of task-related logs. Acknowledgements ---------------- chronolog: nishchay, inna desc: esp david adams, tom glanzman, jim chiang uiuc: ved gc: kevin academy: alok Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`