.. lshpa documentation master file, created by
   sphinx-quickstart on Fri Nov 15 13:15:31 2024.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Plotting Parsl Performance Interactions with R
==============================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:


Introduction
------------

I wanted to learn some R, specifically 
`ggplot2 <https://ggplot2.tidyverse.org/>`_ which is R's implementation of the grammer of graphics. At the same time I wanted to explore the interactions between
different ways of improving `Parsl <https://github.com/parsl/parsl>`_ performance. This post is about me doing those two things together.

Although performance was a focus for Parsl around the 2018
era (for example, see the `Parsl HPDC 2018 paper <https://web.cels.anl.gov/~woz/papers/Parsl_2019.pdf>`_), most of my work on Parsl has been in other areas: reliability, usabilty and maintainability, and the project as a whole has not done much thinking about performance.

Most performance work in that period has been for specific applications, in specific environments, and has not had a lot of methodology - usually it is a user complaining that some run is not what they imagine the theoretical performance should be, and fiddling around with that.

Recently I become interested in exploring things a bit more
holistically: different performance options don't
always make things faster (otherwise we'd set them by
default) and different performance options interact in
different ways that make some of them a waste of time.

The measurement system
----------------------

I made a test (based around a hacked ``parsl-perf``) that tries
a single node workload that is intensive on the Parsl
infrastructure by running as many empty tasks as possible.
(Other load tests are available...). This is a black box
test that returns a time-per-task score as its only output,
given a particular set of treatment factors as input.

By treatment factor, I mean in a very broad sense 
something that might make things faster or slower:
changing Parsl configuration options; patching the
Parsl source code; changing the runtime environment; changing
the test application.

This black box approach is a contrast to a lot of performance work I've done before
where I have instrumented individual pieces of Parsl code and
tried to understand what was taking time and how to make those
individual pieces run faster.

I made a driver script which ran many different treatments
and left it to run on a server for a few months. This picks configurations in a few different ways - for example, some random configurations; some using `z3 <https://github.com/Z3Prover/z3>`_  to satisfy some constraints about balancing the experiment; and some explicity specified by me because I wanted to explore a particular piece of the treatment space.

I ended up
with a dataset of around 15000 Parsl runs, with each
one a time cost per task, labelled with the treatment factors
that were applied to that run.

Here's the average speedup caused by each factor, across all of those runs:

.. image:: 1_means.png

The units of measurement are microseconds per task, across a batch run of dummy jobs.
For some perspective: the base case
with no treatments takes around 1000 microseconds per task (around
1000 tasks per second), and the
best I have observed is around 60 microseconds per task (around 15000
tasks per second).

The dataset
-----------

The raw output of this is a something like a data frame, in CSV format: the selection of treatment factors applied, the time the run happened, and the time cost per task.

From this I generate a list of deltas for each treatment factor: a python script finds every pair of runs: X, X+{t} and calculates the cost (or saving) of applying treatment factor t.

The output of this stage is another data frame in CSV: the treatment factor name, the per-task cost before and after, the delta (denormalised, because it's a function of the previous two columns), and a series of boolean columns indicating if each other treatment factor had already been applied (i.e. the set membership of X is separated out into membership bools)

Most of the graphs and analysis that I've done use this deltas data frame.

I chose CSV (actually semicolon-separated values, Spanish style) because it works well with the shape and types of the data I'm expressing, and it is easy to read/write from different languages, which is something I care about for this project. You could load this straightforwardly into Excel if your preferred data platform is Excel; or get it into sqlite3 and make summaries using SQL - but I've only used Python and R.

Treatments for logging
----------------------

From that earlier plot, the two strongest treatment factors by far are to do with
logging:

* ``cfg_no_init_logging`` reconfigures Parsl to not write to the default ``parsl.log`` file.
* ``src_submit_biglogs_remove`` deletes the noisiest log statements from the Parsl source code.

The graph of mean savings doesn't tell the whole story, though. From that graph it might look like there are two steps to speeding things up: disable parsl.log (``cfg_no_init_logging``) for some big improvement and then speed things up even more by deleting log lines from the Parsl source code (``src_submit_biglogs_remove``).
 
That's not really the case, though. It turns out that these two
treatments interact with other in a pretty straightforward way:

Whichever treatment is applied first makes things run a lot faster.
Applying the other treatment after that doesn't give much further speedup.

Here's a scatter plot that shows for every delta measuring the effect of ``cfg_no_init_logging``.

A lot of the points show a huge improvement (downwards from the diagonal no effect line), but a lot of the points show no improvement at all - they're right on the "no change" line. With a bit of colouring, it's straightforward to see that there is usually no change if ``src_submit_biglogs_remove`` has already been applied, and there's a big change if that hasn't already been applied.

.. image:: 4_before_after_scatter.png

I think this is a pretty understandable interaction: there's a cost to writing logs to ``parsl.log``, and it doesn't really matter if that cost is saved by turning off log file writing, or if that same cost is saved by deleting the logger invocations: the end effect is the same cost is removed either way.

Here's a similar plot, but with the two treatments flipped: each point is a delta for applying src_submit_biglogs_remove, coloured by whether cfg_no_init_logging was applied before or after. It shows basically the same relationship between the two treatments.

.. image:: 4_before_after_scatter_flip.png

What can we learn about Parsl here?

* In the performance space, logging isn't some background distraction to be ignored: in this example, it is the primary cause of slowdown/speedup! (I wonder, even, if our HPDC 2018 paper results could have been substantially improved with this knowledge?)

* This looks like most of the cost of logging comes from writing to
  disk, that there isn't much to be gained from removing log statements
  from source code for the sake of it, but that there might be something
  valuable in considering *what* / *how much* is written to disk beyond Parsl's current
  two modes of "everything ``DEBUG`` and higher" or "nothing" (because
  nothing is a bit useless if you're trying to understand what happened
  in a run) - how users might configure that more accurately (but without
  re-inventing Python's log configuration DSL) - the ``cfg_*`` side of things, how we might pay more attention
  to what is logged at each level (or logged at all) - the ``src_*`` side of things.

* We might consider *how* log lines are written to the filesystem: right now (I think) there's a flush after every log line output - which especially on a shared/network fs is probably incredibly expensive. We might target other behaviours (in code or in explanation) - for example, don't flush on ``DEBUG`` lines.

* We might consider *where* logs are written. The ``cfg_rundir_ramfs`` treatment puts logs in an in-memory filesystem, and captures some of that cost-of-file-writing speedup compared to writing to the default persistent ``/home`` filesystem.

* It's suspicious to me that removing the log lines seems to average so much lower benefit than not configuring the logging - but I think that comes from a slightly different feature interaction with lazy logging that I will talk about elsewhere.

What if we want logging?
------------------------

Getting rid of logging might seem like an easy win...

... but it's not necessarily easy: you now lose out on a bunch of debuggability.

... and it might not be a win at all - if you can't understand what's breaking, you can't fix things and maybe can't complete your run at all.

So let's remove all the readings with either of the two big log reduction factors and plot that first diagram again. This isn't the same as just cropping off the first two bars of the first diagram, because features interact with each other.

.. image:: 1_means_keep_logs.png

Some of the factors have got much stronger, some have got a bit weaker - to compare the changes a bit more directly:

.. image:: 1_means_both.png

A lot of the most effective treatments are now more effective on average: I think that means with more logging in place, there is more overhead to be saved through these other means.

Prefetching is less effective though.

There are a few surprises in this plot though - none of which I deeply understand.

* ``cfg_encrypted`` has gone from an expected negative effect to somehow being an improvement.
* In worker side logging, ``src_pool_logs_remove`` used to not have much average effect but forcing submit-side logging on, now that treatment has a noticeable negative effect - I'll come to that in a later section (the puzzle of removing logs). But conversely ``cfg_worker_debug`` used to have a strong negative effect and now has almost no average effect.
* ``src_danger_no_dict_proxy``, which removes some error handling code, went from being on average a small improvement to being on average quite a disimprovement.

Metrics and scales
------------------

In the previous plots, I've used a metric of (micro)seconds per task.

In the Parsl HDPC 2018 paper, there are a few different related metrics:

* time per task (in the section that looks like it is measuring the full round trip time for a single task - not what is happening in my measurements here, because I allow all the tasks to happen concurrently)

* related to that, the total time to execute some variable number of tasks (in the section measuring how long it takes to execute a large batch of tasks - much closer to what I'm measuring here).

* tasks/sec - this was used for measuring throughput in section 5.3 of the HPDC paper. This measurement is "the same" as the time per task: there is an isomorphism (a = 1/b, both ways) between the two representations.

But, that reciprocal has some serious non-linearity that can serious change how you visualize and imagine the effect of any particular factor.

Here's an example starting at the untreated setup (around
1182 microseconds per task), adding treatments one by one until reaching one of the
best setups (around 60 microseconds per task):

.. image:: 6_shap.png

That shows that treating the setup with ``cfg_no_init_logging`` - turning off submit side logging in the Parsl configuration - is by far the main way to get from 1182 to 61 microseconds: 
around 80% of the per-task time evaporates with that configuration option alone. Maybe it's not even worth considering the other treatments at all, they're all so small!

But lets flip the x-axis round and plot the same data using a tasks/second axis:

.. image:: 6_shap_by_rate.png

This presents a very different story: setting ``cfg_no_init_logging`` does have a noticeable speedup - getting from 1182 to 4166 tasks per second... but in terms of linear tasks/second, a few of what seemed like small improvements in the last plot bring quite big speedups when measured as tasks/second. These last few 179 microseconds of saving from all the other treatments take the rate from 4166 tasks per second all the way up to 16393 tasks per second!

What sort of scales, measurements and visualisations make sense?


Here's another plot from the same sequence of runs, this time representing the per-task time by area:

.. image:: 6_shap_treemap.png


Telling different stories with the same slow and fast runs
----------------------------------------------------------

That sequence of applying treatment factors to get from slow to fast is just one of many possible sequences, one of something like 17-factorial possibilities. This particular sequence was chosen by iteratively applying the best factor at every step until all of the optimal factors had been applied.

Here's a different sequence, chosen to favour the "nicest" factors first - what savings can be gained by safe changes to the environment and configuration first, then modifications to Parsl source that do not remove features, then configuration options that disable features, and then modifications to Parsl source that do remove features.

The start and end runs are the same: going from 1182 down to 61 microseconds per task (or 850 tasks per second up to 16000 tasks per second). But the costs are allocated a bit differently:

.. image:: 6_shap_by_rate_nice.png

.. image:: 6_shap_treemap_nice.png

In this telling of the story, paying attention to the Python version has the biggest effect, although disable logs is still quite a big effect.

So in this multitude of paths, we might tell different stories depending on our preferences: we might want to show that when people complain about performance, it is their own fault for misconfiguring their environment, attributing blame as far away from the codebase as possible. Or we might want to highlight which source code changes are most worth investigating, because that is what we as Parsl developers control the most.

Python 3.13 nogil jit opt
-------------------------

The ``python313t_nogil_jit`` treatment factor scores highest after the two log factors.

This factor changes the Python runtime, from the OS supplied Python 3.11 install to a self-compiled Python 3.13 with all these options enabled:

* expensive optimization (``--enable-optimizations``), some of which are based on profiling the target machine

* no Global Interpreter Lock - this is a new Python feature that promises to unlock more concurrency

* Just In Time compilation - another experimental feature

At the time I implemented this factor, I didn't realise how big an effect this would have, so I didn't split out these options into separate factors. But I think the scale of these changes is big enough that Parsl developers should get more understanding of what is happening here.

Unlike disabling logs, switching to a "better" Python version doesn't remove features and is also the "natural path of progress". So this is a very positive direction.

Affinity
--------

Users at ALCF have been very interested in process affinity to hardware threads (or cores or CPUs, or whatever that unit is called this year). Even on my laptop, I've informally seen decent performance improvements.

So I looked at which processes were using CPU time in the base case, and made a custom affinity scheme for that, pairing up processes to particular cores (pairs of hardware threads, in the case of this test server). That scheme is activated by the ``affinity`` factor, and looks like this:

.. list-table::

  * - Hardware thread
    - Assignment
  * - 6,7
    - workflow submission process
  * - 5
    - interchange
  * - 4
    - process worker pool manager(s)
  * - 0-3
    - one worker per hardware thread

This works really well when applied to slow runs, but runs out of steam and actually slows things down for faster runs:

.. image:: 4_before_after_affinity.png

``affinity`` doesn't appear in the list of optimal treatments I talked about before, and that is because (I think) when runs are down in the fast end of things, this treatment hinders rather than helps.

That doesn't mean *all* CPU affinity work is going to be harmful.

What can Parsl learn?

Caring about CPU use/assignment is relevant, probably even in "small" use cases.

The puzzle of removing logs
---------------------------

At the slow end of things, removing log lines makes things slower sometimes. This plot shows removing log lines from the worker side of Parsl:

.. image:: 4_before_after_scatter_pool_logs_remove.png 

When the task rate is slow, removing log lines from the codebase makes things slower!

I had a hypothesis: that when worker_debug is enabled, the additional filesystem latency introduces more opportunities for other processes to get a share of the CPU. I plotted that, and it looks to be the other way round: the bad runs are all with ``cfg_worker_debug`` turned off.

.. image:: 4_before_after_scatter_pool_logs_remove_coloured.png 
Other top treatments
--------------------

The other top treatments on that first plot that I haven't mentioned so far are:

* ``cfg_prefetch_100`` - this maintains a queue of tasks in each worker pool, which eliminates latency when a worker completes a task and by default has to do a round trip to the interchange to get the next task. Due to a typo in my implementation, this actually maintains a queue of 1000 tasks in each worker pool. I can imagine this having even more effect when workers are connected to the interchange over the network - so that the round trip latency is much higher.

  Why isn't this on always? Because it introduces load imbalance between worker pools: with a 1000 task queue, 1000 tasks might go to one pool, and 0 tasks to all the others. With 1 hour tasks on 1000 worker nodes, that's a prohibitive imbalance.

* ``cfg_poll_100`` - various pieces of Parsl poll at (by default) 10ms intervals. This treatment changes that to 100ms, which pays off in the slower runs where processes are fighting for CPU.

* ``src_interchange_longer_poll`` - this decouples the poll loop in the interchange from the above configuration.


Proposed pull requests
----------------------

Its rare for Parsl pull requests to be accompanied by any serious
performance analysis. Occasionally I'll run ``parsl-perf`` on my
laptop if I feel like a change is going to affect that, but nothing
more formal.

A few of the treatment factors in these runs are potential pull
requests to go into the codebase:

* ``cfg_encrypted`` which turning ZMQ security on. This is off by default but should be on by default. The main concern with turning it on by default was performance, but this work suggests encryption in this case is not particularly expensive on the scale of other treatments.

* log formatting laziness - rather than formatting a log message every time,
  Python logging lets you defer the formatting until the message string
  is needed. In the case that the string is never needed (for example,
  the log line is `DEBUG` but logging is set to `INFO`), the formatting
  never happens. This is a style that we've recently been pushing on in
  Parsl on the assumption that it is a good thing to do, but without any
  concrete numbers.

* interchange exit polling - the interchange process doesn't always exit.
  It is usually meant to be terminated by the main workflow process, but
  if that doesn't happen, then the interchange sits around forever. This
  change makes the interchange check for liveness of its controlling workflow
  process on each poll iteration

* interchange single threading - the interchange has two threads which
  communicate by polling. that feels awkward and doing everything with a
  single thread feels nicer to some of the Parsl developers. It is interesting
  to see how badly a naive implementation of that affects performance in this
  use case (although I'd hope it would decrease latency and CPU load in
  other situations)

Per-factor log colouring
------------------------

The colour scale is hard to get right here. I'm trying it out
as the base 2 logarithm of the fraction of the before_score saved or added by this treatment.

.. image:: 7_colour.png

Whats interesting here?

Can see ``affinity`` get worse (redder) at the far end.
But ``python313t-jit-nogil`` also gets worse a bit like
that - even though it is in the optimal treatment set!
(there's a mix of blue and red at the fast end, so maybe
theres something else interesting to investigate here)

``cfg_prefetch_100`` shows the most improvement at the
fast end - vs the log removal (and python and affinity
treatments which help you get to the fast zone to begin
with)

Linear Regression
-----------------

A similar but different approach is to fit a linear model
to the runs dataset - ignore individual before/after deltas.

I used R's ``lm`` linear model function to fit all
15000 runs, with independent variables being whether each
treatment was applied, and while I was at it, I threw in
the unix timestamp and a couple of cosine-based time-of-day
variables.

Unsurprisingly this doesn't fit very well - there's an
overall sense of what is going to be faster or slower, at
least. Down at the fast end, there's a huge range on the y axis
for any given real x-axis location. And that recurring mystery
splodge of badness at the slow end shows up again.

.. image:: 4_lm3.png

The linear model has fairly simple to understand parameters:
each treatment factor has a microseconds/task cost that it saves (or adds) when applied - fitted across the whole dataset.

This is spiritually the same as the first plot I showed,
even though it is computed in a very different way.

So here they are plotted next to each other:

.. image:: 4_lm3_cols.png

Linear Regression with Interactions
-----------------------------------

Next I added in interaction variables between all pairs of treatments. This gives about 400 more parameters to fit for the linear model.

.. image:: 4_lm6.png

This is a better fit, but it is harder to interpret many interaction variables for explainability. (the aim of all this is not to *predict*, but to *explain* and *understand*)

For example, this model has these costs for the two big log treatment factors - higher than in the purely linear model, and more descriptive of the real cost saving from applying just one of these treatments:

.. code-block::

  cfg_no_init_logging                                   -5.771e+02
  src_submit_biglogs_remove                             -4.907e+02

with an interaction variable between them - an additional cost that is applied if both of these treatments
are applied:

.. code-block::

  interac.cfg_no_init_logging.src_submit_biglogs_remove  3.580e+02

which captures that if you apply both, then you don't get anywhere near the simple sum of the two treatments.

Here are the interaction variables for every pair of treatments, with stronger negative interactions more to the top right. The graph is symmetric around the diagonal - because the cost of A+B is the same as the cost of B+A.

.. image:: 4_lm6_tiles.png

Eigenvectors and Principal Components
-------------------------------------

Principal component analysis works with vectors of values, trying to squish information across all dimensions into a smaller number of dimensions (the principal components).

What I'm going to do here is make 30-dimensional vectors that have one value for each treatment. Each vector will be formed by cumulatively applying treatment factors in some sequence, and measuring the delta for each treatment. That sequence will be different for each vector. With the current dataset, that results in about 550 vectors.

So the data points in this analysis will be entire sequences of applying every treatment, rather than individual treatments.

Here's one of the standard ways of visualizing a PCA:

.. image:: 8_pca_biplot.png

That squishes all of the 30-dimensional vectors down into a 2d space that captures (in this case) 50% of variance of the dataset.

The biggest component shows ``cfg_no_init_logging`` and ``src_submit_biglogs_remove`` in opposition to each other: it's like the PCA is trying to collapse those two dimensions down into a single dimension: do you want to save your log time by removing the log statements, or by not logging to a file? either-or, not both. That aligns with the earlier explanation of choosing between those two options.

The other axis is bit more of a surprise to me - quite a bit of variance in the results come from sliding between affinity and Python 3.13 - maybe that's what's being shown in the earlier coloured plot where Python 3.13 gets worse at fast speeds?

The other standard visualization for PCAs is a scree plot, which shows how much variance is captured in each successive component - it shows the first two dimensions capture 50% of the variance, with the other 28 dimensions capturing the other 50%.

.. image:: 8_pca_screeplot.png

This analysis doesn't really show much that I already knew from earlier on.

Experimental design
-------------------

We probably don't need anywhere near as many runs as I have made: for basic summary stats, these averages have been pretty consistent since near the beginning of my runs. The subsequent compute time might have been more interestingly used on more focused questions, or not used at all.

I didn't set out with much opinion on how to design the experiment: I just picked a bunch of different treatments (and added more over time) and set my script running.

Some treatment factors might make more sense measured and analysed as continuous variables: for example, ``cfg_prefetch_N``.

There are also some treatment factors that can't both be applied sensibly: changing a log line to be lazy, and removing that lgo line. The boolean setup I have so far isn't really set up for measuring or analysing that.

Other treatment factors
-----------------------

Here are some ideas for different treatment factors:

* different environments: for example, running on a particular HPC system rather than my small test server

* different workloads: rather than no-op tasks, have tasks that sleep, make intense CPU use, make intense memory use, make intense IO, make workflows that have lots of sequential dependencies - probably some difficulty in figuring out a decent cost measurement there, perhaps overhead on top of theoretical perfect performance?

* deeper investigation of the various Python 3.13 options, because in this work I made 4 changes at once.

* When ZMQ encryption was being implemented, there was folklore around different ZMQ installs giving very different performance behaviour. Investigate that properly.

* Different executors - perhaps using a workload/metric that isn't targeted at high throughput?

Project ideas
-------------

* Run similar methodology on different (more realistic) workloads and execution environments - build a "zoo" of interesting measurable things? (c.f. `TaPS <https://github.com/proxystore/taps>`_ but with scores on a suitable scale)

* Finding an optimised configuration treatment set feels a bit genetic-algorithmy to me - "feed in many possible configuration tweaks, give a target application to optimise, sit back and watch it go". i.e. semi-automatically do this kind of exploration on a user's real target application in a real environment. - as a user tool?

* How to use this to understand how Parsl performance changes over time?  (both retrospectively and forwards with new/proposed PRs) (eg. CI style performance regression discovery, and PR review process)

* How to wonder about whether a particular change would help or hinder, and have that magically answered? (see proposed PRs section)

* Pay attention to what is logged, how it is logged, at what level it is logged, what choices the user has for logging, and how that is explained - that's probably not a student project because it needs someone who has lots of experience in debugging to be able to evaluate logs. 

* src_submit_biglogs_remove, with huge effect, removes 17 log lines - so would removing (or changing the level of) 1/17th of those make things 1/17th faster? or are some log lines substantially more expensive?  (also could be 17 different treatments... if we wanted to measure that in this framework)

* Someone who knows about stats ... how to do these stats properly?

* Figure how the different pieces of the Python 3.13 factor cost/save  (3.11 to 3.13; JIT; No GIL; local platform optimisation)

* Investigate more generic affinity schemes, and other Linux process scheduling options (batch, realtime, niceness)

Comparison to HPDC 2018 Parsl paper measurements
------------------------------------------------

This is the Parsl HPDC 2018 paper: https://web.cels.anl.gov/~woz/papers/Parsl_2019.pdf

The principal comparisons in the hpdc paper are to show
how Parsl scales under different task batch and worker counts
on real HPC systems. The measurements I've done here are
not on a real HPC system and are not varying the batch
sizes (or at least - not very much - there is one treatment
``app_big`` which doesn't affect things much) or workers counts (again not much - there are more noticeable effects in the two treatments here: ``cfg_2_blocks`` and ``cfg_4_workers``). There is also adversarial presentation of other execution systems aimed at showing that Parsl is better. In this work, the adversary is other treated setups of same Parsl.

Implicitly in the HPDC paper, I think, is "the best parsl setup
we could get". Here, I'm focusing more on understanding some
of the different things that you could do to get the best
Parsl setup (in a very focused environment, but hopefully
leading to understanding of broader effects)

The tasks per second on real machines with Parsl's HTEX-related executors in the HPDC paper is around 1000 tasks/second. That's roughly the same as the no-treatement rate here (but better) - but far below the fastest runs I've seen in this environment.


End.

Indices and tables
==================

* :ref:`genindex`