.. Parsl wide event observability prototype report documentation master file, created by
   sphinx-quickstart on Sun Oct 26 10:13:01 2025.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Parsl wide event observability prototype report
===============================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:


Introduction
------------

These are notes about my current iteration of observability.

The Question
------------

Read this. give your feedback about if you think it should
be the direction Parsl goes.

Structure
---------

Somewhere be clear that this is in two related parts:

* emitting wide records from Parsl Python code

* ingesting and analysing records, including wide ones,
  from many sources

What is observability, in the Parsl context
-------------------------------------------

"Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."
(https://en.wikipedia.org/wiki/Observability)

how can we tell whats going on inside a Parsl run by what we can see externally? how can we expose more "stuff"?

abstract notion, but in Parsl that is about: logging; Parsl monitoring.

and this prototype is about doing that "better"

A motivating use case
---------------------

Here's a use case that is hard with what exists in master-branch Parsl right now.

I want to know, for a particular arbitrary task, the timings of the task as it is submitted by the user workflow, flows through the DFK, into the htex interchange, worker pool, executes on an htex worker, and flows back to the user, with the timing of each step.

What exists in master Parsl right now is some information in monitoring, and some information in log files. The monitoring information is focused on the high level task model, not what is happening inside Parsl to run that high level model. Logs as they exist now are extremely ad-hoc, spread around in at least 4 different places, and poorly integrated: for example, log messages sometimes do not contain context about which task they refer to, do not represent that context uniformly (e.g. in a greppable way) and are ambiguous about context (e.g. some places refer to task 1, the DFK-level task 1, and some places refer to task 1, the HTEX-level task 1, which could be something completely different).

As a contrast, an example output of this prototype (as of 2025-10-26) is:

.. code:: none

  === About task 358 ===
  2025-10-26 10:29:46.467298 MainThread@117098 Task 358: will be sent to executor htex_Local (parsl.log)
  2025-10-26 10:29:46.467412 MainThread@117098 Task 358: Adding output dependencies (parsl.log)
  2025-10-26 10:29:46.467484 MainThread@117098 Task 358: Added output dependencies (parsl.log)
  2025-10-26 10:29:46.467550 MainThread@117098 Task 358: Gathering dependencies: start (parsl.log)
  2025-10-26 10:29:46.467620 MainThread@117098 Task 358: Gathering dependencies: end (parsl.log)
  2025-10-26 10:29:46.467685 MainThread@117098 Task 358: submitted for App random_uuid, not waiting on any dependency (parsl.log)
  2025-10-26 10:29:46.467752 MainThread@117098 Task 358: has AppFuture: <AppFuture at 0x7f8bc1aed730 state=pending> (parsl.log)
  2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log)
  2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log)
  2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log)
  2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log)
  2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log)
  2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log)
  2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log)
  2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log)
  2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log)
  2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log)
  2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log)
  2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log)
  2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log)
  2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log)
  2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log)

This integrates four log files and two task identifier systems into a single sequence of events.

case study 2
------------

Debugging monitoring performance as part of developing this prototype.

findcommon tool - finds common task sequence for templated logs and outputs their sequence, like this:

First run parsl-perf like this:

.. code:: none

  parsl-perf --config parsl/tests/configs/htex_local.py 

  [...]

  ==== Iteration 3 ====
  Will run 58179 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 58179 tasks submitted ... waiting for completion
  Submission took 103.880 seconds = 560.059 tasks/second
  Runtime: actual 137.225s vs target 120s
  Tasks per second: 423.967
  Tests complete - leaving DFK block

which executes a total around 60000 tasks.

First, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at .

That's perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this.

From that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what's taking up time there?

Run findcommon (a could-be-modular-but-isn't helper from this observability prototype) on the result:

.. code:: none

  0.0: Task %s: will be sent to executor htex_local
  0.00023320618468031343: Task %s: Adding output dependencies
  0.0004515730863634116: Task %s: Added output dependencies
  0.000672943356177761: Task %s: Gathering dependencies: start
  0.0008952160973877195: Task %s: Gathering dependencies: end
  0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency
  0.001316777690507145: Task %s: has AppFuture: %s
  0.0015680651123983979: Task %s: initializing state to pending
  23.684763520758917: HTEX task %s: putting onto pending_task_queue
  23.68483662049256: HTEX task %s: fetched task
  23.684863335335613: Task %s: changing state from pending to launched
  23.6850573607536: Task %s: try %s launched on executor %s with executor id %s
  23.685248910492184: Task %s: Standard out will not be redirected.
  23.685424046734745: Task %s: Standard error will not be redirected.
  23.686276226995773: Putting HTEX task %s into scheduler
  23.686777094898495: HTEX task %s: received executor task
  23.687025900194147: HTEX task %s: Completed task
  23.687268549254735: HTEX task %s: All processing finished for task
  23.687837933843614: HTEX task %s: Manager %r: Removing task from manager
  23.688483699079185: Task %s: changing state from launched to exec_done


In this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That's fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf's submission time to completion time (30 seconds).

What's useful with findcommon's output is that it shows the insides of Parsl's working in more depth: 20 states instead of parsl-perf's start, submitted, end. And the potential exists to calculate other statistics on these events.

So in this average case, there's something slow happening between setting the task to pending, and then the task "simultaneously" being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue.

That's a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange.

So let's perform some deeper investigations -- observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master).

Let's flesh out the whole submission process with some more log lines. On the DFK side, that's pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task.

For example, here's the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task.

.. code:: none

  +  task_logger.debug("TMP: dependencies added, calling launch_if_ready")
     self.launch_if_ready(task_record)
  +  task_logger.debug("TMP: launch_if_ready returned")
 

My suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I'm going to add more logging around that.

Then rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor:

.. code:: none

  0.0020453477688227: Task %s: TMP: submitted into launch pool executor
  0.002256870306434224: Task %s: TMP: launch_if_ready returned
  14.073021359217009: Task %s: TMP: before submitter lock
  [...]
  14.078550367412324: Task %s: changing state from launched to exec_done

Don't worry too much about the final time (14s) changing from 23s in the earlier run -- that's a characteristic of parsl-perf batch sizes that I'm working on in another branch.

If that's the case, I'd expect the thread pool executor, previously much faster than htex, to show similar characteristics:

surprisingly, though although the throughput is not much much higher... the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn't so much waiting there - less than a second on average. That's possibly because the executor can get through tasks much faster so the queue doesn't build up so much?


.. code:: none

  ==== Iteration 2 ====
  Will run 68976 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 68976 tasks submitted ... waiting for completion
  Submission took 117.915 seconds = 584.965 tasks/second
  Runtime: actual 118.417s vs target 120s
  Tasks per second: 582.485

.. code:: none

  0.0: Task %s: will be sent to executor threads
  0.00014157412110423425: Task %s: Adding output dependencies
  0.0002898652725047201: Task %s: Added output dependencies
  0.000425118042214259: Task %s: Gathering dependencies: start
  0.0005696294991521399: Task %s: Gathering dependencies: end
  0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency
  0.0008433702196425292: Task %s: has AppFuture: %s
  0.0010710284919573986: Task %s: initializing state to pending
  0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready
  0.0012973675719411494: Task %s: submitting into launch pool executor
  0.0014397921284467212: Task %s: submitted into launch pool executor
  0.0015767665501452072: Task %s: TMP: launch_if_ready returned
  0.3143575128217656: Task %s: before submitter lock
  0.31448896150771743: Task %s: after submitter lock, before executor.submit
  0.3146383380777917: Task %s: after before executor.submit
  0.3147926810507091: Task %s: changing state from pending to launched
  0.3149239369413048: Task %s: try 0 launched on executor threads
  0.31504996538376506: Task %s: Standard out will not be redirected.
  0.31504996538376506: Task %s: Standard out will not be redirected.
  0.3151759985402679: Task %s: Standard error will not be redirected.
  0.3151759985402679: Task %s: Standard error will not be redirected.
  0.315319734920821: Task %s: changing state from launched to exec_done


So maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I'll have to implement that for this observability prototype, because it's not done already, but once it's done it should be reusable. and it should share most infrastructure with `findcommon`. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have "enough" tasks at all times to keep dispatching. With 8 cores on my laptop, I'd like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It's also important to note that this findcommon output shows latency, not throughput -- though high latency at particular points is an indication of throughput problems.

Or, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they're a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation.

Add a new log line like this into the interchange at a suitable point (after task queueing, for example):

.. code:: none

  +  ql = len(self.pending_task_queue)
  +  logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql})

Now can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script:

.. code:: python

  import matplotlib.pyplot as plt
  from parsl.observability.getlogs import getlogs

  logs = getlogs()

  # looking for these logs: 
  # "metric": "pending_task_queue_length", "queued_tasks": ql})

  metrics = [(float(l['created']), int(l['queued_tasks']))
             for l in logs
             if  'metric' in l
             and l['metric'] == "pending_task_queue_length"
            ]


  plt.scatter(x=[m[0] for m in metrics],
              y=[m[1] for m in metrics])

  plt.show()

and indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10.

That's enough for now, but it's a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a `for`-loop doesn't progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case.

This "understand the queue lengths (or implicit queue lengths) towards execution" investigation style has been useful in understanding Parsl performance limitations in the past.

Configurability
---------------

A soft start is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn't matter performance-wise too much about adding in the extra stuff on log calls.

The current parsl stuff is not set up for arbitrary log
configuration outside of the submit-side process: for example,
the worker helpers don't do any log config at all and rely
on their enclosing per-executor environments to do it, which
i think some do not.

htex interchange and worker logs have a hardcoded log config
with a single debug boolean.

I'd like to do something a bit more flexible than adding more
parameters, that reflect that in the future people might want
to configure their handlers differently rather than using the
JSONHandler.

Relation to monitoring and logging systems
------------------------------------------

Parsl uses python's logging module. Parsl has its own monitoring system. This work builds on both of those, but changes how thye are used significantly.

Why observability? vs monitoring
--------------------------------

Original monitoring prototype was focused on what is happening
with Parsl user level concepts: tasks, blocks for example
as they move through simple states. Anything deeper is part
of the idea of "Parsl makes it so you don't have to think
about anything happening inside". Which is not how things are
in reality: neither for code reliabilty or for performance.

Parsl Monitoring is too strict in a couple of ways:

The data model is fairly hard-coded into the architecture:
specific SQL schema, specific message formats and specific
places where those messages are sent.

The transmission model is real-time. Even with recent radio
plugins, the assumption is still that messages will arrive
soon after being sent.

The almost-real-time data transmisison model is especially
awkward when combined with SQL: distributed system events
will arrive at different times or in the original UDP model
perhaps not at all, and the "first" message that
creates a task (for the purposes of the database) might arrive
after some secondary data that requires that primary key to
exist. yes, it's nice for the SQL database to follow foreign
key rules, especially when looking at the data "afterwards"
but that's not realistic for distributed unreliable events.

Contrast this to:

* arbitrary logs that may be different for different kinds
  of execution - for example, different executor implementations

* pouring over these logs "later" - there's no need for those
  logs to accumulate in real time in one place for post-facto
  analysis. And in practice, when doing log analysis rather
  than monitoring analysis, "send me a tarball of your runinfo"
  is a standard technique.

Parsl Monitoring is not well suited to adding in new ad-hoc
events, perhaps just for one off debugging cases that will be
thrown away: schema modifications in both SQL and in message
tasks, and rearranging code to accomodate that is a serious
business. Contrast logging: there's always a logger in any
part of Parsl, and you can ``logger.debug("hello")`` pretty
much anywhere.

Parsl Monitoring was also implemented with a fixed queries
/ dashboard mindset: one set of views that is expected to be
sufficient. As time has shown, people like to make other
outputs from this data.

want to debug/profile whats happening *inside parsl* rather
than *inside the user workflow*.

Other components
----------------

Some components aren't Parsl-aware: for example work queue
has no notion of a Parsl task ID. and it runs its own
logging system, that is not Python, and so not amenable to
Python monitoring radios.

Another example: swap out interchange impl for a different
one with a different internal model: a schema of events for
task progress through the original interchange doesn't
necessarily work for some other implementation.

ZMQ generates log messages which have been useful sometimes
and these could be gatewayed into observability.

Lots of different identifier spaces, loosely structured,
not necessarily hierarchical: for example, an htex task
is not necessarily "inside" a Parsl task, as htex can be
used outside of a Parsl DFK (which is where the notion of
Parsl task lives). An htex task often runs in a unix process
but that process also runs many htex tasks, and an htex task
also has extent outside of that worker process: there's no
containment relationship either way.

Optional pieces of observability
--------------------------------

log levels - INFO vs DEBUG

missing log files - eg. start with parsl.log, add in more
files for more detail

security - not so much the Parsl core use case, but eg
GC executor vs GC endpoint logs vs GC central services have
different security properties.

The observability approach needs to accomodate that, for
any/all reasons, some events won't be there. There can't be
a "complete set of events" to complain about being incomplete.

less data, well the reports in whatever form are less
informative, to the extent that the lack of data makes them
so.

Scope for other hacking
-----------------------

should be easy to add other events - the core observability
model shouldn't be prescriptive about what events exist,
what they look like.

should be easy to use them in analysis

should be easy to import some other event stream, in whatever
format

notions of moving logs around to a place of analysis should
not be baked into the architecture. realtime options, file
based options, ... - that is an area for experimentation
(see Chronolog) and this work should facilitate that rather
than being prescriptive

storage and query of logs is also an area for experimentation.
there are lots of hosted commercial services. lots of small
scale stuff: 

eg. at small enough scale, parsing logs into a
python session and using e.g. set and list comprehensions is
a legitimate way to analyse things (rather than something
awkwardly shameful that will be replaced by The Real Thing
later) - especially given Parsl users general exposure to
data science in Python.

Ignore Parsl Monitoring per-task resource monitoring and
do something else that generates similar observability
records. This was always some disappointment with getting
WQ resource monitoring into the Parsl monitoring database:
what exists there that could be imported?

Inside Python parts of Parsl, this data *is* available
in realtime at the point of logging as it goes to whatever
LogHandler is running in each python process. that isn't
true in general on the "event model" side of things, though.

Target audience
---------------

serious debugger/profiler people

not management-dashboard types - although management
dashboards absolutely should be creatable with this
observability data.

Build your own stack
--------------------

Lots of observability commentary online talks as if you are
building your entire stack, to the extent that you care about
observability. Parsl is much more a pile of configurable
components stuck together, all with their own different
options for observability/logging/monitoring, and without
easy ability for someone to add a consistent model throughout
the entire stack of code.


Algebra of merging logs
-----------------------

widening

joining

post-facto relationship establishment

relabelling

look at relational algebra for phrasing and concepts

notion of identity and key-sequences:
eg. parsl_dfk/parsl_task_id is a globally
unique identifier for a parsl task across time and space,
and so is parsl_dfk/executor_label/block_number or
parsl_dfk/executor_label/manager_id/worker_number
-- although manager ID is also (in short form) globally
unique.
this is distinct from the hierarchical relations between
entities - although hierarchical identity keys will often
line up with execution hierarchy.

peter buneman XML keys stuff did nested sequences of keys
for identifying xml fragments, c. year 2000

Who wants this?
---------------

At least me.

Many users don't come explicitly asking for monitoring-style
information but do ask how to understand whats going on
inside. And then are excited to use monitoring when it
exists.


Browser UI
----------

what might a browser UI look like for this?

compare parsl-visualize. compare scrolling through logs,
but with some more interactivity (eg. click / choose
"show me logs from same dfk/task_id")


Performance measurement of patch stack on 2025-10-27
----------------------------------------------------

.. code::

  pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000

Running parsl-perf with constant block sizes (to avoid
queue length speed changes):

master branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tps


more-task-tied-logs: 1024

json-wide-log-records: 537
   - but without initializing the JSONHandler: 1122

end of branch with all changes up to now: 385


Applying this approach for academy
----------------------------------

As an extreme "data might not be there" -- perhaps Parsl
isn't there at all. What does this code and these techniques
look like applied to a similar but very different codebase,
Academy, which doesn't have any distributed monitoring at 
all at the moment. There are ~100 log lines in the
academy codebase right now. How much can this be converted
in a few hours, and then analysed in similar ways?

The point here being both considering this as a real
logging direction for academy, and as a proof-of-generality
beyond Parsl.

thoughts:

repr-of-ID-object might not be the correct format for logging:
I want stuff that is nice strings for values, but repr
(although it is a string) is more designed to look like a
python code fragment rather than the core value of an object.
Maybe ``str`` is better, and maybe some other way of
representing the ID is better? The point is to have values
that work well in aggregate, database style analysis, not
easy on the human eye.

academy logging so far focused on looking pretty on the
console: eg ANSI colour - that's at the opposite end of the
spectrum to what this observability project is trying to log.

rule of thumb for initial conversion: whatever is substituted
into the human message should be added as an extras field.

other components this might be applied to: Globus Compute.
Various related workflow systems that sit on top of Parsl.

Parsl contains two "mini-workflow-systems" on top of core
Parsl: parsl-perf and pytest tests. It could be interesting
to illustrate how those fit in without being a core part
of Parsl observability.

See also
--------

netlogger

my dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component.

chronolog https://grc.iit.edu/research/projects/chronolog/

syslog

In-parsl Python coding
----------------------

Acknowledging observability as a first-order feature means
we can make big changes to code.

Every log message needs to be visited to add context. In
many places a bunch of that context can be added by helpers:
for example, in my prototype, some module level loggers are
replaced by object-level loggers: there is a per-task
logger (actually LoggerAdapter) in the TaskRecord, and
logging to that automatically adds on relevant DFK and task
metadata: at most log sites, the change to add that metadata
is to switch from invoking methods on the module-level logger
object, invoking them on the new task-level logger instead.

Some log lines bracket an operation, and to help with that,
my prototype introduces a LexicalSpan context manager which
can be used as part of a `with` block to identify the span
of work starting and ending.

Move away from forming ad-hoc string templates and make
log calls look more machine-readable. This is somewhat
stylistic: with task ID automatically logged, there is no
need to substitute in task ID in some arbitrary subset of
task-related logs.

Acknowledgements
----------------

chronolog: nishchay, inna

desc: esp david adams, tom glanzman, jim chiang

uiuc: ved

gc: kevin

academy: alok

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`