Parsl wide event observability prototype report

Introduction

These are notes about my current iteration of observability.

The Question

Read this. give your feedback about if you think it should be the direction Parsl goes.

Structure

Somewhere be clear that this is in two related parts:

  • emitting wide records from Parsl Python code

  • ingesting and analysing records, including wide ones, from many sources

What is observability, in the Parsl context

“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” (https://en.wikipedia.org/wiki/Observability)

how can we tell whats going on inside a Parsl run by what we can see externally? how can we expose more “stuff”?

abstract notion, but in Parsl that is about: logging; Parsl monitoring.

and this prototype is about doing that “better”

A motivating use case

Here’s a use case that is hard with what exists in master-branch Parsl right now.

I want to know, for a particular arbitrary task, the timings of the task as it is submitted by the user workflow, flows through the DFK, into the htex interchange, worker pool, executes on an htex worker, and flows back to the user, with the timing of each step.

What exists in master Parsl right now is some information in monitoring, and some information in log files. The monitoring information is focused on the high level task model, not what is happening inside Parsl to run that high level model. Logs as they exist now are extremely ad-hoc, spread around in at least 4 different places, and poorly integrated: for example, log messages sometimes do not contain context about which task they refer to, do not represent that context uniformly (e.g. in a greppable way) and are ambiguous about context (e.g. some places refer to task 1, the DFK-level task 1, and some places refer to task 1, the HTEX-level task 1, which could be something completely different).

As a contrast, an example output of this prototype (as of 2025-10-26) is:

=== About task 358 ===
2025-10-26 10:29:46.467298 MainThread@117098 Task 358: will be sent to executor htex_Local (parsl.log)
2025-10-26 10:29:46.467412 MainThread@117098 Task 358: Adding output dependencies (parsl.log)
2025-10-26 10:29:46.467484 MainThread@117098 Task 358: Added output dependencies (parsl.log)
2025-10-26 10:29:46.467550 MainThread@117098 Task 358: Gathering dependencies: start (parsl.log)
2025-10-26 10:29:46.467620 MainThread@117098 Task 358: Gathering dependencies: end (parsl.log)
2025-10-26 10:29:46.467685 MainThread@117098 Task 358: submitted for App random_uuid, not waiting on any dependency (parsl.log)
2025-10-26 10:29:46.467752 MainThread@117098 Task 358: has AppFuture: <AppFuture at 0x7f8bc1aed730 state=pending> (parsl.log)
2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log)
2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log)
2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log)
2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log)
2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log)
2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log)
2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log)
2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log)
2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log)
2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log)
2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log)
2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log)
2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log)
2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log)
2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log)

This integrates four log files and two task identifier systems into a single sequence of events.

case study 2

Debugging monitoring performance as part of developing this prototype.

findcommon tool - finds common task sequence for templated logs and outputs their sequence, like this:

First run parsl-perf like this:

parsl-perf --config parsl/tests/configs/htex_local.py

[...]

==== Iteration 3 ====
Will run 58179 tasks to target 120 seconds runtime
Submitting tasks / invoking apps
All 58179 tasks submitted ... waiting for completion
Submission took 103.880 seconds = 560.059 tasks/second
Runtime: actual 137.225s vs target 120s
Tasks per second: 423.967
Tests complete - leaving DFK block

which executes a total around 60000 tasks.

First, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at .

That’s perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this.

From that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what’s taking up time there?

Run findcommon (a could-be-modular-but-isn’t helper from this observability prototype) on the result:

0.0: Task %s: will be sent to executor htex_local
0.00023320618468031343: Task %s: Adding output dependencies
0.0004515730863634116: Task %s: Added output dependencies
0.000672943356177761: Task %s: Gathering dependencies: start
0.0008952160973877195: Task %s: Gathering dependencies: end
0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency
0.001316777690507145: Task %s: has AppFuture: %s
0.0015680651123983979: Task %s: initializing state to pending
23.684763520758917: HTEX task %s: putting onto pending_task_queue
23.68483662049256: HTEX task %s: fetched task
23.684863335335613: Task %s: changing state from pending to launched
23.6850573607536: Task %s: try %s launched on executor %s with executor id %s
23.685248910492184: Task %s: Standard out will not be redirected.
23.685424046734745: Task %s: Standard error will not be redirected.
23.686276226995773: Putting HTEX task %s into scheduler
23.686777094898495: HTEX task %s: received executor task
23.687025900194147: HTEX task %s: Completed task
23.687268549254735: HTEX task %s: All processing finished for task
23.687837933843614: HTEX task %s: Manager %r: Removing task from manager
23.688483699079185: Task %s: changing state from launched to exec_done

In this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That’s fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf’s submission time to completion time (30 seconds).

What’s useful with findcommon’s output is that it shows the insides of Parsl’s working in more depth: 20 states instead of parsl-perf’s start, submitted, end. And the potential exists to calculate other statistics on these events.

So in this average case, there’s something slow happening between setting the task to pending, and then the task “simultaneously” being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue.

That’s a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange.

So let’s perform some deeper investigations – observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master).

Let’s flesh out the whole submission process with some more log lines. On the DFK side, that’s pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task.

For example, here’s the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task.

+  task_logger.debug("TMP: dependencies added, calling launch_if_ready")
   self.launch_if_ready(task_record)
+  task_logger.debug("TMP: launch_if_ready returned")

My suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I’m going to add more logging around that.

Then rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor:

0.0020453477688227: Task %s: TMP: submitted into launch pool executor
0.002256870306434224: Task %s: TMP: launch_if_ready returned
14.073021359217009: Task %s: TMP: before submitter lock
[...]
14.078550367412324: Task %s: changing state from launched to exec_done

Don’t worry too much about the final time (14s) changing from 23s in the earlier run – that’s a characteristic of parsl-perf batch sizes that I’m working on in another branch.

If that’s the case, I’d expect the thread pool executor, previously much faster than htex, to show similar characteristics:

surprisingly, though although the throughput is not much much higher… the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn’t so much waiting there - less than a second on average. That’s possibly because the executor can get through tasks much faster so the queue doesn’t build up so much?

==== Iteration 2 ====
Will run 68976 tasks to target 120 seconds runtime
Submitting tasks / invoking apps
All 68976 tasks submitted ... waiting for completion
Submission took 117.915 seconds = 584.965 tasks/second
Runtime: actual 118.417s vs target 120s
Tasks per second: 582.485
0.0: Task %s: will be sent to executor threads
0.00014157412110423425: Task %s: Adding output dependencies
0.0002898652725047201: Task %s: Added output dependencies
0.000425118042214259: Task %s: Gathering dependencies: start
0.0005696294991521399: Task %s: Gathering dependencies: end
0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency
0.0008433702196425292: Task %s: has AppFuture: %s
0.0010710284919573986: Task %s: initializing state to pending
0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready
0.0012973675719411494: Task %s: submitting into launch pool executor
0.0014397921284467212: Task %s: submitted into launch pool executor
0.0015767665501452072: Task %s: TMP: launch_if_ready returned
0.3143575128217656: Task %s: before submitter lock
0.31448896150771743: Task %s: after submitter lock, before executor.submit
0.3146383380777917: Task %s: after before executor.submit
0.3147926810507091: Task %s: changing state from pending to launched
0.3149239369413048: Task %s: try 0 launched on executor threads
0.31504996538376506: Task %s: Standard out will not be redirected.
0.31504996538376506: Task %s: Standard out will not be redirected.
0.3151759985402679: Task %s: Standard error will not be redirected.
0.3151759985402679: Task %s: Standard error will not be redirected.
0.315319734920821: Task %s: changing state from launched to exec_done

So maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I’ll have to implement that for this observability prototype, because it’s not done already, but once it’s done it should be reusable. and it should share most infrastructure with findcommon. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have “enough” tasks at all times to keep dispatching. With 8 cores on my laptop, I’d like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It’s also important to note that this findcommon output shows latency, not throughput – though high latency at particular points is an indication of throughput problems.

Or, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they’re a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation.

Add a new log line like this into the interchange at a suitable point (after task queueing, for example):

+  ql = len(self.pending_task_queue)
+  logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql})

Now can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script:

import matplotlib.pyplot as plt
from parsl.observability.getlogs import getlogs

logs = getlogs()

# looking for these logs:
# "metric": "pending_task_queue_length", "queued_tasks": ql})

metrics = [(float(l['created']), int(l['queued_tasks']))
           for l in logs
           if  'metric' in l
           and l['metric'] == "pending_task_queue_length"
          ]


plt.scatter(x=[m[0] for m in metrics],
            y=[m[1] for m in metrics])

plt.show()

and indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10.

That’s enough for now, but it’s a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a for-loop doesn’t progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case.

This “understand the queue lengths (or implicit queue lengths) towards execution” investigation style has been useful in understanding Parsl performance limitations in the past.

Configurability

A soft start is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn’t matter performance-wise too much about adding in the extra stuff on log calls.

The current parsl stuff is not set up for arbitrary log configuration outside of the submit-side process: for example, the worker helpers don’t do any log config at all and rely on their enclosing per-executor environments to do it, which i think some do not.

htex interchange and worker logs have a hardcoded log config with a single debug boolean.

I’d like to do something a bit more flexible than adding more parameters, that reflect that in the future people might want to configure their handlers differently rather than using the JSONHandler.

Relation to monitoring and logging systems

Parsl uses python’s logging module. Parsl has its own monitoring system. This work builds on both of those, but changes how thye are used significantly.

Why observability? vs monitoring

Original monitoring prototype was focused on what is happening with Parsl user level concepts: tasks, blocks for example as they move through simple states. Anything deeper is part of the idea of “Parsl makes it so you don’t have to think about anything happening inside”. Which is not how things are in reality: neither for code reliabilty or for performance.

Parsl Monitoring is too strict in a couple of ways:

The data model is fairly hard-coded into the architecture: specific SQL schema, specific message formats and specific places where those messages are sent.

The transmission model is real-time. Even with recent radio plugins, the assumption is still that messages will arrive soon after being sent.

The almost-real-time data transmisison model is especially awkward when combined with SQL: distributed system events will arrive at different times or in the original UDP model perhaps not at all, and the “first” message that creates a task (for the purposes of the database) might arrive after some secondary data that requires that primary key to exist. yes, it’s nice for the SQL database to follow foreign key rules, especially when looking at the data “afterwards” but that’s not realistic for distributed unreliable events.

Contrast this to:

  • arbitrary logs that may be different for different kinds of execution - for example, different executor implementations

  • pouring over these logs “later” - there’s no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, “send me a tarball of your runinfo” is a standard technique.

Parsl Monitoring is not well suited to adding in new ad-hoc events, perhaps just for one off debugging cases that will be thrown away: schema modifications in both SQL and in message tasks, and rearranging code to accomodate that is a serious business. Contrast logging: there’s always a logger in any part of Parsl, and you can logger.debug("hello") pretty much anywhere.

Parsl Monitoring was also implemented with a fixed queries / dashboard mindset: one set of views that is expected to be sufficient. As time has shown, people like to make other outputs from this data.

want to debug/profile whats happening inside parsl rather than inside the user workflow.

Other components

Some components aren’t Parsl-aware: for example work queue has no notion of a Parsl task ID. and it runs its own logging system, that is not Python, and so not amenable to Python monitoring radios.

Another example: swap out interchange impl for a different one with a different internal model: a schema of events for task progress through the original interchange doesn’t necessarily work for some other implementation.

ZMQ generates log messages which have been useful sometimes and these could be gatewayed into observability.

Lots of different identifier spaces, loosely structured, not necessarily hierarchical: for example, an htex task is not necessarily “inside” a Parsl task, as htex can be used outside of a Parsl DFK (which is where the notion of Parsl task lives). An htex task often runs in a unix process but that process also runs many htex tasks, and an htex task also has extent outside of that worker process: there’s no containment relationship either way.

Optional pieces of observability

log levels - INFO vs DEBUG

missing log files - eg. start with parsl.log, add in more files for more detail

security - not so much the Parsl core use case, but eg GC executor vs GC endpoint logs vs GC central services have different security properties.

The observability approach needs to accomodate that, for any/all reasons, some events won’t be there. There can’t be a “complete set of events” to complain about being incomplete.

less data, well the reports in whatever form are less informative, to the extent that the lack of data makes them so.

Scope for other hacking

should be easy to add other events - the core observability model shouldn’t be prescriptive about what events exist, what they look like.

should be easy to use them in analysis

should be easy to import some other event stream, in whatever format

notions of moving logs around to a place of analysis should not be baked into the architecture. realtime options, file based options, … - that is an area for experimentation (see Chronolog) and this work should facilitate that rather than being prescriptive

storage and query of logs is also an area for experimentation. there are lots of hosted commercial services. lots of small scale stuff:

eg. at small enough scale, parsing logs into a python session and using e.g. set and list comprehensions is a legitimate way to analyse things (rather than something awkwardly shameful that will be replaced by The Real Thing later) - especially given Parsl users general exposure to data science in Python.

Ignore Parsl Monitoring per-task resource monitoring and do something else that generates similar observability records. This was always some disappointment with getting WQ resource monitoring into the Parsl monitoring database: what exists there that could be imported?

Inside Python parts of Parsl, this data is available in realtime at the point of logging as it goes to whatever LogHandler is running in each python process. that isn’t true in general on the “event model” side of things, though.

Target audience

serious debugger/profiler people

not management-dashboard types - although management dashboards absolutely should be creatable with this observability data.

Build your own stack

Lots of observability commentary online talks as if you are building your entire stack, to the extent that you care about observability. Parsl is much more a pile of configurable components stuck together, all with their own different options for observability/logging/monitoring, and without easy ability for someone to add a consistent model throughout the entire stack of code.

Algebra of merging logs

widening

joining

post-facto relationship establishment

relabelling

look at relational algebra for phrasing and concepts

notion of identity and key-sequences: eg. parsl_dfk/parsl_task_id is a globally unique identifier for a parsl task across time and space, and so is parsl_dfk/executor_label/block_number or parsl_dfk/executor_label/manager_id/worker_number – although manager ID is also (in short form) globally unique. this is distinct from the hierarchical relations between entities - although hierarchical identity keys will often line up with execution hierarchy.

peter buneman XML keys stuff did nested sequences of keys for identifying xml fragments, c. year 2000

Who wants this?

At least me.

Many users don’t come explicitly asking for monitoring-style information but do ask how to understand whats going on inside. And then are excited to use monitoring when it exists.

Browser UI

what might a browser UI look like for this?

compare parsl-visualize. compare scrolling through logs, but with some more interactivity (eg. click / choose “show me logs from same dfk/task_id”)

Performance measurement of patch stack on 2025-10-27

pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000

Running parsl-perf with constant block sizes (to avoid queue length speed changes):

master branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tps

more-task-tied-logs: 1024

json-wide-log-records: 537
  • but without initializing the JSONHandler: 1122

end of branch with all changes up to now: 385

Applying this approach for academy

As an extreme “data might not be there” – perhaps Parsl isn’t there at all. What does this code and these techniques look like applied to a similar but very different codebase, Academy, which doesn’t have any distributed monitoring at all at the moment. There are ~100 log lines in the academy codebase right now. How much can this be converted in a few hours, and then analysed in similar ways?

The point here being both considering this as a real logging direction for academy, and as a proof-of-generality beyond Parsl.

thoughts:

repr-of-ID-object might not be the correct format for logging: I want stuff that is nice strings for values, but repr (although it is a string) is more designed to look like a python code fragment rather than the core value of an object. Maybe str is better, and maybe some other way of representing the ID is better? The point is to have values that work well in aggregate, database style analysis, not easy on the human eye.

academy logging so far focused on looking pretty on the console: eg ANSI colour - that’s at the opposite end of the spectrum to what this observability project is trying to log.

rule of thumb for initial conversion: whatever is substituted into the human message should be added as an extras field.

other components this might be applied to: Globus Compute. Various related workflow systems that sit on top of Parsl.

Parsl contains two “mini-workflow-systems” on top of core Parsl: parsl-perf and pytest tests. It could be interesting to illustrate how those fit in without being a core part of Parsl observability.

See also

netlogger

my dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component.

chronolog https://grc.iit.edu/research/projects/chronolog/

syslog

In-parsl Python coding

Acknowledging observability as a first-order feature means we can make big changes to code.

Every log message needs to be visited to add context. In many places a bunch of that context can be added by helpers: for example, in my prototype, some module level loggers are replaced by object-level loggers: there is a per-task logger (actually LoggerAdapter) in the TaskRecord, and logging to that automatically adds on relevant DFK and task metadata: at most log sites, the change to add that metadata is to switch from invoking methods on the module-level logger object, invoking them on the new task-level logger instead.

Some log lines bracket an operation, and to help with that, my prototype introduces a LexicalSpan context manager which can be used as part of a with block to identify the span of work starting and ending.

Move away from forming ad-hoc string templates and make log calls look more machine-readable. This is somewhat stylistic: with task ID automatically logged, there is no need to substitute in task ID in some arbitrary subset of task-related logs.

Acknowledgements

chronolog: nishchay, inna

desc: esp david adams, tom glanzman, jim chiang

uiuc: ved

gc: kevin

academy: alok

Indices and tables