sphinx.addnodesdocument)}( rawsourcechildren](docutils.nodescomment)}(hParsl wide event observability prototype report documentation master file, created by sphinx-quickstart on Sun Oct 26 10:13:01 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive.h]h TextParsl wide event observability prototype report documentation master file, created by sphinx-quickstart on Sun Oct 26 10:13:01 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive.}parenth sba attributes}(ids]classes]names]dupnames]backrefs] xml:spacepreserveutagnameh hh _documenthsource3/home/benc/parsl/src/observability-report/index.rstlineKubh section)}(hhh](h title)}(h)Wide event observability prototype reporth]h)Wide event observability prototype report}(hh1h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hh,h&hh'h(h)Kubh compound)}(hhh]htoctree)}(hhh]h}(h]h]h]h]h!]hindexentries] includefiles]maxdepthKcaptionContentsglobhidden includehiddennumberedK titlesonly rawentries] rawcaptionhVuh%hDh'h(h)K hhAubah}(h]h]toctree-wrapperah]h]h!]uh%h?hh,h&hh'h(h)Nubeh}(h])wide-event-observability-prototype-reportah]h])wide event observability prototype reportah]h!]uh%h*hhh&hh'h(h)Kubh+)}(hhh](h0)}(h Introductionh]h Introduction}(hhqh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hhnh&hh'h(h)Kubhindex)}(hhh]h}(h]h]h]h]h!]hP](single observabilityindex-0hNtainlineuh%hh'h(h)Khhnh&hubh target)}(hhh]h}(h]h]h]h]h!]refidhuh%hhhnh&hh'h(h)Kubh paragraph)}(hXThese are notes about my current iteration of a Parsl and Academy *observability* prototype. It is intended to help with plugin style integration between those two components and an open collection of friends including Globus Compute, Diaspora and Chronolog.h](hBThese are notes about my current iteration of a Parsl and Academy }(hhh&hh'Nh)Nubh emphasis)}(h*observability*h]h observability}(hhh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhhubh prototype. It is intended to help with plugin style integration between those two components and an open collection of friends including Globus Compute, Diaspora and Chronolog.}(hhh&hh'Nh)Nubeh}(h]hah]h]h]h!]uh%hh'h(h)Khhnh&hexpect_referenced_by_name}expect_referenced_by_id}hhsubh)}(hXHAs an abstract concept: "Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs." (https://en.wikipedia.org/wiki/Observability). In the context of this project, it means outputting enough information about the system to understand why bugs happen in the system.h](hAs an abstract concept: “Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” (}(hhh&hh'Nh)Nubh reference)}(h+https://en.wikipedia.org/wiki/Observabilityh]h+https://en.wikipedia.org/wiki/Observability}(hhh&hh'Nh)Nubah}(h]h]h]h]h!]refurihuh%hhhubh). In the context of this project, it means outputting enough information about the system to understand why bugs happen in the system.}(hhh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Khhnh&hubh)}(hTThere is plenty to read about Observability on the web: google around for more info.h]hTThere is plenty to read about Observability on the web: google around for more info.}(hhh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khhnh&hubh)}(hXBThis is often neglected as part of the core functionality of a research prototype, as demonstrations are run in controlled environments with the original authors both ready to respond to the slightest problem with copious time, and ready to restart everything from scratch repeatedly until the desired outcome is achieved.h]hXBThis is often neglected as part of the core functionality of a research prototype, as demonstrations are run in controlled environments with the original authors both ready to respond to the slightest problem with copious time, and ready to restart everything from scratch repeatedly until the desired outcome is achieved.}(hhh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khhnh&hubh)}(hAs soon as that prototype is forced into production, those two properties evaporate and the need for observability manifests: both users and legacy developers need to understand what is happening in this suddenly wider and more hostile world.h]hAs soon as that prototype is forced into production, those two properties evaporate and the need for observability manifests: both users and legacy developers need to understand what is happening in this suddenly wider and more hostile world.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khhnh&hubh)}(hIn the Parsl world that exists now as two separate systems: Log files and Parsl Monitoring. This report will explore ways in which they can be usefully unified and extended.h]hIn the Parsl world that exists now as two separate systems: Log files and Parsl Monitoring. This report will explore ways in which they can be usefully unified and extended.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khhnh&hubh)}(hhh]h}(h]h]h]h]h!]hP]((hDESCindex-1hNt(hCZIj*hNt(hNSFj*hNtehuh%hh'h(h)Khhnh&hubh)}(hhh]h}(h]h]h]h]h!]hj*uh%hhhnh&hh'h(h)K!ubh)}(hXXThis project builds on experiences debugging Parsl within the DESC project, as well as work sponsored by NSF and CZI understanding pluggability and code maturity as they affect architectural decisions. As a concurrent activiy, I have used some of this experience to push changes into the Academy codebase to support its move towards production.h]hXXThis project builds on experiences debugging Parsl within the DESC project, as well as work sponsored by NSF and CZI understanding pluggability and code maturity as they affect architectural decisions. As a concurrent activiy, I have used some of this experience to push changes into the Academy codebase to support its move towards production.}(hj9h&hh'Nh)Nubah}(h]j*ah]h]h]h!]uh%hh'h(h)K"hhnh&hh}h}j*j0subh)}(hXMA distant vision is a project-wide or personal-space-wide observability system -- but it is important to acknowledge that this is a distant and vague vision, and that actually what I want to happen is stuff on the scale of weeks to months that is usable on that timescale, with others left to take up that distant vision if desired.h]hXNA distant vision is a project-wide or personal-space-wide observability system – but it is important to acknowledge that this is a distant and vague vision, and that actually what I want to happen is stuff on the scale of weeks to months that is usable on that timescale, with others left to take up that distant vision if desired.}(hjIh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)K$hhnh&hubh)}(hThis report attempts to describe abstract concepts but ground them in practice and concrete code-driven examples. This report also tries to give open questions and opportunities that might be interesting for other people to work on.h]hThis report attempts to describe abstract concepts but ground them in practice and concrete code-driven examples. This report also tries to give open questions and opportunities that might be interesting for other people to work on.}(hjWh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)K&hhnh&hubh)}(hX3How to try this out? Because I want you to try this out. It's in my Parsl ``benc-observability`` branch. I will try to label use cases as expected to work or not, and in what context. There will also be some academy-related stuff in that branch, with the intention that it moves elsewhere as productionised.h](hLHow to try this out? Because I want you to try this out. It’s in my Parsl }(hjeh&hh'Nh)Nubh literal)}(h``benc-observability``h]hbenc-observability}(hjoh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjeubh branch. I will try to label use cases as expected to work or not, and in what context. There will also be some academy-related stuff in that branch, with the intention that it moves elsewhere as productionised.}(hjeh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)K+hhnh&hubh+)}(hhh](h0)}(hWhat exists in Parsl now?h]hWhat exists in Parsl now?}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)K3ubh)}(hPParsl has two observability approaches: file-based logging and Parsl Monitoring.h]hPParsl has two observability approaches: file-based logging and Parsl Monitoring.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)K5hjh&hubh)}(hX File based logging is very loosely structured. Log lines are intended for direct human consumption, with minimal automated processing: for example, "grepping the logs". Within Parsl there is a variety of log formats, usually depending on the component which generated the log. Logs are directed to a filesystem accessible by the particular component, which in practice they are especially awkward without a shared file system. It is easy to add a new log line, by writing what is effectively a glorified ``print`` statement.h](hXFile based logging is very loosely structured. Log lines are intended for direct human consumption, with minimal automated processing: for example, “grepping the logs”. Within Parsl there is a variety of log formats, usually depending on the component which generated the log. Logs are directed to a filesystem accessible by the particular component, which in practice they are especially awkward without a shared file system. It is easy to add a new log line, by writing what is effectively a glorified }(hjh&hh'Nh)Nubjn)}(h ``print``h]hprint}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh statement.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)K7hjh&hubh)}(hhh]h}(h]h]h]h]h!]hP](h"Parsl monitoring; database managerindex-2hNtahuh%hh'h(h)K9hjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)K:ubh)}(hXParsl Monitoring generates monitoring events that are not intended to be seen by humans. Instead they are conveyed to a Monitoring Database Manager which munges them into a relational schema. This schema is then typically accessed by users via futher processing for pre-prepared visualization or ad-hoc queries designed by data-science aware power users. These monitoring events can be conveyed by a pluggable interface, for example over the network, and in contrast to the logging approach above, distributed filesystem access (in the broadest sense) is not required. The strict SQL schema makes the data model extremely hard to extend ad-hoc.h]hXParsl Monitoring generates monitoring events that are not intended to be seen by humans. Instead they are conveyed to a Monitoring Database Manager which munges them into a relational schema. This schema is then typically accessed by users via futher processing for pre-prepared visualization or ad-hoc queries designed by data-science aware power users. These monitoring events can be conveyed by a pluggable interface, for example over the network, and in contrast to the logging approach above, distributed filesystem access (in the broadest sense) is not required. The strict SQL schema makes the data model extremely hard to extend ad-hoc.}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)K;hjh&hh}h}jjsubh)}(hParsl Monitoring was also implemented with a fixed queries / dashboard mindset: one set of views that is expected to be sufficient. As time has shown, people like to make other outputs from this data.h]hParsl Monitoring was also implemented with a fixed queries / dashboard mindset: one set of views that is expected to be sufficient. As time has shown, people like to make other outputs from this data.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)K=hjh&hubh)}(h_This report builds on both of these approaches. I'll talk about more details in later sections.h]haThis report builds on both of these approaches. I’ll talk about more details in later sections.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)KBhjh&hubeh}(h]what-exists-in-parsl-nowah]h]what exists in parsl now?ah]h!]uh%h*hhnh&hh'h(h)K3ubh+)}(hhh](h0)}(hDiagramh]hDiagram}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)KGubh)}(hof the components/flow.h]hof the components/flow.}(hj!h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)KIhjh&hubh)}(hlto distinguish the pieces of my work, and also to distinguish the pieces of what might be substituted where.h]hlto distinguish the pieces of my work, and also to distinguish the pieces of what might be substituted where.}(hj/h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)KKhjh&hubh)}(hwspecific emphasis that this is common techniques, not a single implementation or protocol standards or single anything.h]hwspecific emphasis that this is common techniques, not a single implementation or protocol standards or single anything.}(hj=h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)KMhjh&hubh literal_block)}(hXPython logger API ----> JSON structured logs \ |--> log movement --> Python-based query model --> graphs/reports non-JSON structured logs / to one place --> post facto schema normalisation (eg. WQ, Parsl monitoring) classically files, --> data-structure based queries but eg ryan/kafka logan demo/agent pollingh]hXPython logger API ----> JSON structured logs \ |--> log movement --> Python-based query model --> graphs/reports non-JSON structured logs / to one place --> post facto schema normalisation (eg. WQ, Parsl monitoring) classically files, --> data-structure based queries but eg ryan/kafka logan demo/agent polling}hjMsbah}(h]h]h]h]h!]forcehighlight_args}h#h$languagedefaultuh%jKh'h(h)KOhjh&hubeh}(h]diagramah]h]diagramah]h!]uh%h*hhnh&hh'h(h)KGubh+)}(hhh](h0)}(h#Concept: Universal personal loggingh]h#Concept: Universal personal logging}(hjkh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjhh&hh'h(h)K\ubh)}(hImagine *all* of my academy/globus endpoint/parsl/application runs going into a single log space. permanently. no matter what the project, location, etc.h](hImagine }(hjyh&hh'Nh)Nubh)}(h*all*h]hall}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjyubh of my academy/globus endpoint/parsl/application runs going into a single log space. permanently. no matter what the project, location, etc.}(hjyh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)K^hjhh&hubh)}(hXJheres a logging use case/notion: universal personal log. all my GC endpoints, parsl runs, academy runs, application submits, go into a single log space that has everything I am running everywhere in the science cloud, by default - eg. identified by my globus credential ID. no separation whatsoever. no project distinction, etc.h]hXJheres a logging use case/notion: universal personal log. all my GC endpoints, parsl runs, academy runs, application submits, go into a single log space that has everything I am running everywhere in the science cloud, by default - eg. identified by my globus credential ID. no separation whatsoever. no project distinction, etc.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)K`hjhh&hubh)}(h[what does that look like to work with on the query side. what does that look like to query?h]h[what does that look like to work with on the query side. what does that look like to query?}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kbhjhh&hubeh}(h]"concept-universal-personal-loggingah]h]#concept: universal personal loggingah]h!]uh%h*hhnh&hh'h(h)K\ubh+)}(hhh](h0)}(hTarget audienceh]hTarget audience}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Keubh)}(hThis project is mainly aimed at systems integrators and application builders who are expecting to perform serious debugging and profiling work at a deep technical level. It should support other use-cases such as management-friendly dashboards.h]hThis project is mainly aimed at systems integrators and application builders who are expecting to perform serious debugging and profiling work at a deep technical level. It should support other use-cases such as management-friendly dashboards.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kghjh&hubh)}(hXThese users will often have integrated several research-quality projects: for example, Academy submitting into Globus Compute. As systems integrators and application builders, they aren't directly interested in the borders these individual projects have built around themselves, but want to understand (for example) where their simulation task is inside the whole ad-hoc stack. This mirrors the microcosm of Parsl existing as a pile of configurable and pluggable compponents, each with their own observability options.h]hXThese users will often have integrated several research-quality projects: for example, Academy submitting into Globus Compute. As systems integrators and application builders, they aren’t directly interested in the borders these individual projects have built around themselves, but want to understand (for example) where their simulation task is inside the whole ad-hoc stack. This mirrors the microcosm of Parsl existing as a pile of configurable and pluggable compponents, each with their own observability options.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kihjh&hubh)}(hXvMany of the target audience do not, in my experience, come asking directly for observability as a feature. Instead they come with questions such as "Parsl is slow - how can I make it faster?". Without understanding the statement "Parsl is slow" (which as often as not turns out to be "my application code is slow"), it is hard to make progress on "how can I make it faster?"h]hXMany of the target audience do not, in my experience, come asking directly for observability as a feature. Instead they come with questions such as “Parsl is slow - how can I make it faster?”. Without understanding the statement “Parsl is slow” (which as often as not turns out to be “my application code is slow”), it is hard to make progress on “how can I make it faster?”}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kkhjh&hubeh}(h]target-audienceah]h]target audienceah]h!]uh%h*hhnh&hh'h(h)Keubh+)}(hhh](h0)}(h Modularityh]h Modularity}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Knubh)}(hThis report emphasises modularity as a core tenet, to the extent that a single product codebase is not particularly an end goal.h]hThis report emphasises modularity as a core tenet, to the extent that a single product codebase is not particularly an end goal.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kphjh&hubh+)}(hhh](h0)}(h9Modularity as a requirement for a rich research landscapeh]h9Modularity as a requirement for a rich research landscape}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Ksubh)}(hXA "rich research landscape" means many components, each with competing priorities for productionworthiness vs research. TODO: cite the work done as part of NSF/CZI sustainability grants about recognising the difference between goals rather than ignoring them.h]hXA “rich research landscape” means many components, each with competing priorities for productionworthiness vs research. TODO: cite the work done as part of NSF/CZI sustainability grants about recognising the difference between goals rather than ignoring them.}(hj0h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kuhjh&hubh)}(hXjExpecting a single observability system to provide all needs is unlikely to succeed in such a varied research-style environment: while some users are tolerant of appalling code quality in exchange for interesting research results, those same users require production quality from other components in the same stack; and those tolerances vary with every use case.h]hXjExpecting a single observability system to provide all needs is unlikely to succeed in such a varied research-style environment: while some users are tolerant of appalling code quality in exchange for interesting research results, those same users require production quality from other components in the same stack; and those tolerances vary with every use case.}(hj>h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Kwhjh&hubeh}(h]9modularity-as-a-requirement-for-a-rich-research-landscapeah]h]9modularity as a requirement for a rich research landscapeah]h!]uh%h*hjh&hh'h(h)Ksubh+)}(hhh](h0)}(h#Hourglass model with several waistsh]h#Hourglass model with several waists}(hjWh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjTh&hh'h(h)Kzubh)}(hXsthe hourglass model is intended to provide a small number of plugin/intergration points in the same way that the Internet Protocol does for applications/application protocols vs networking technologies (for example, HTTP over mobile phone network vs telnet over ARPANET is then sufficient integration to run without more work: telnet over mobile phone, HTTP over arpanet)h]hXsthe hourglass model is intended to provide a small number of plugin/intergration points in the same way that the Internet Protocol does for applications/application protocols vs networking technologies (for example, HTTP over mobile phone network vs telnet over ARPANET is then sufficient integration to run without more work: telnet over mobile phone, HTTP over arpanet)}(hjeh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)K|hjTh&hubh)}(hThe hourglass waists are:h]hThe hourglass waists are:}(hjsh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)KhjTh&hubh)}(hhh]h}(h]h]h]h]h!]hP]((hPython; loggingindex-3hNt(hC programming languagejhNt(hProgramming languages; CjhNtehuh%hh'h(h)KhjTh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjTh&hh'h(h)Kubh bullet_list)}(hhh](h list_item)}(hXpython ``logging`` system: any Python code can send log messages to the built-in ``logging`` system and any Python code can register to receive any log messages. This can support components that live in the Python ecosystem. That includes enough of the current ecosystem to consider specially, but not enough to be universal: for example, when running task through Parsl's Work Queue executor, a substantial piece of execution happens in code written in the C programming language. h]h)}(hXpython ``logging`` system: any Python code can send log messages to the built-in ``logging`` system and any Python code can register to receive any log messages. This can support components that live in the Python ecosystem. That includes enough of the current ecosystem to consider specially, but not enough to be universal: for example, when running task through Parsl's Work Queue executor, a substantial piece of execution happens in code written in the C programming language.h](hpython }(hjh&hh'Nh)Nubjn)}(h ``logging``h]hlogging}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh? system: any Python code can send log messages to the built-in }(hjh&hh'Nh)Nubjn)}(h ``logging``h]hlogging}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubhX system and any Python code can register to receive any log messages. This can support components that live in the Python ecosystem. That includes enough of the current ecosystem to consider specially, but not enough to be universal: for example, when running task through Parsl’s Work Queue executor, a substantial piece of execution happens in code written in the C programming language.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)Khjh&hubj)}(hX4JSON records: a second point of modularity is representing observability information as JSON objects. This is flexible data format which complements the Python code approach of the previous waist. Often observability information which cannot flow through the Python ``logging`` API can flow as JSON records. h]h)}(hX3JSON records: a second point of modularity is representing observability information as JSON objects. This is flexible data format which complements the Python code approach of the previous waist. Often observability information which cannot flow through the Python ``logging`` API can flow as JSON records.h](hX JSON records: a second point of modularity is representing observability information as JSON objects. This is flexible data format which complements the Python code approach of the previous waist. Often observability information which cannot flow through the Python }(hjh&hh'Nh)Nubjn)}(h ``logging``h]hlogging}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh API can flow as JSON records.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)Khjh&hubeh}(h]jah]h]h]h!]bullet*uh%jh'h(h)KhjTh&hh}h}jjsubeh}(h]#hourglass-model-with-several-waistsah]h]#hourglass model with several waistsah]h!]uh%h*hjh&hh'h(h)Kzubeh}(h] modularityah]h] modularityah]h!]uh%h*hhnh&hh'h(h)Knubh+)}(hhh](h0)}(h$High level structure of this projecth]h$High level structure of this project}(hj%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj"h&hh'h(h)Kubh)}(h7This report breaks observability into four rough parts:h]h7This report breaks observability into four rough parts:}(hj3h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj"h&hubj)}(hhh](j)}(h5A data model of wide event records: :ref:`datamodel` h]h)}(h4A data model of wide event records: :ref:`datamodel`h](h$A data model of wide event records: }(hjHh&hh'Nh)Nubh pending_xref)}(h:ref:`datamodel`h]h h)}(hjTh]h datamodel}(hjWh&hh'Nh)Nubah}(h]h](xrefstdstd-refeh]h]h!]uh%hhjRubah}(h]h]h]h]h!]refdochO refdomainjbreftyperef refexplicitrefwarn reftarget datamodeluh%jPh'h(h)KhjHubeh}(h]h]h]h]h!]uh%hh'h(h)KhjDubah}(h]h]h]h]h!]uh%jh'h(h)KhjAh&hubj)}(h'Creating wide records: :ref:`creating` h]h)}(h&Creating wide records: :ref:`creating`h](hCreating wide records: }(hjh&hh'Nh)NubjQ)}(h:ref:`creating`h]jV)}(hjh]hcreating}(hjh&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhjubah}(h]h]h]h]h!]refdochO refdomainjreftyperef refexplicitrefwarnjscreatinguh%jPh'h(h)Khjubeh}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)KhjAh&hubj)}(h1Moving those event records around: :ref:`moving` h]h)}(h0Moving those event records around: :ref:`moving`h](h#Moving those event records around: }(hjh&hh'Nh)NubjQ)}(h :ref:`moving`h]jV)}(hjh]hmoving}(hjh&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhjubah}(h]h]h]h]h!]refdochO refdomainjreftyperef refexplicitrefwarnjsmovinguh%jPh'h(h)Khjubeh}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)KhjAh&hubj)}(h+Analysing those records: :ref:`analysing` h]h)}(h)Analysing those records: :ref:`analysing`h](hAnalysing those records: }(hjh&hh'Nh)NubjQ)}(h:ref:`analysing`h]jV)}(hjh]h analysing}(hjh&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhjubah}(h]h]h]h]h!]refdochO refdomainj reftyperef refexplicitrefwarnjs analysinguh%jPh'h(h)Khjubeh}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)KhjAh&hubeh}(h]h]h]h]h!]jjuh%jh'h(h)Khj"h&hubh)}(h.. _datamodel:h]h}(h]h]h]h]h!]h datamodeluh%hh)Khj"h&hh'h(ubeh}(h]$high-level-structure-of-this-projectah]h]$high level structure of this projectah]h!]uh%h*hhnh&hh'h(h)Kubeh}(h] introductionah]h] introductionah]h!]uh%h*hhh&hh'h(h)Kubh+)}(hhh](h0)}(hThe data modelh]hThe data model}(hjMh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjJh&hh'h(h)Kubh+)}(hhh](h0)}(hIntroduction to wide eventsh]hIntroduction to wide events}(hj^h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj[h&hh'h(h)Kubh)}(hGas JSON objects, as Python LogRecords, as roughly isomorphic structuresh]hGas JSON objects, as Python LogRecords, as roughly isomorphic structures}(hjlh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj[h&hubh)}(hywide in the style of denormalised data warehouses, rather than heavily normalised like more traditional relational model.h]hywide in the style of denormalised data warehouses, rather than heavily normalised like more traditional relational model.}(hjzh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj[h&hubh)}(hithey should be wide and flat: do not create elaborate object graphs. key/value, with values being simple.h]hithey should be wide and flat: do not create elaborate object graphs. key/value, with values being simple.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj[h&hubh)}(his often ad-hoc: people are writing code to run tasks, not building data models represent the observable state of their tasks. so don't bake that into the system too much, and expect to be flexible.h]his often ad-hoc: people are writing code to run tasks, not building data models represent the observable state of their tasks. so don’t bake that into the system too much, and expect to be flexible.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj[h&hubeh}(h]introduction-to-wide-eventsah]h]introduction to wide eventsah]h!]uh%h*hjJh&hh'h(h)Kubh+)}(hhh](h0)}(h>What exists now: Parsl python logs vs Parsl monitoring recordsh]h>What exists now: Parsl python logs vs Parsl monitoring records}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Kubh)}(hMEspecially for this chapter, how both of those can be embedded as wide eventsh]hMEspecially for this chapter, how both of those can be embedded as wide events}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hParsl monitoring structured records are equally valid examples of existing structured records, alongside with equal value to logging as differently structured records.h]hParsl monitoring structured records are equally valid examples of existing structured records, alongside with equal value to logging as differently structured records.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hXnOriginal Parsl monitoring prototype was focused on what is happening with Parsl user level concepts: tasks, blocks for example as they move through simple states. Anything deeper is part of the idea of "Parsl makes it so you don't have to think about anything happening inside". Which is not how things are in reality: neither for code reliabilty or for performance.h]hXtOriginal Parsl monitoring prototype was focused on what is happening with Parsl user level concepts: tasks, blocks for example as they move through simple states. Anything deeper is part of the idea of “Parsl makes it so you don’t have to think about anything happening inside”. Which is not how things are in reality: neither for code reliabilty or for performance.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hoften want to debug/profile whats happening *inside parsl* rather than *inside the user workflow* - and the distinction between the two is often unclear.h](h,often want to debug/profile whats happening }(hjh&hh'Nh)Nubh)}(h*inside parsl*h]h inside parsl}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh rather than }(hjh&hh'Nh)Nubh)}(h*inside the user workflow*h]hinside the user workflow}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh8 - and the distinction between the two is often unclear.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(h.. _partialdata:h]h}(h]h]h]h]h!]h partialdatauh%hh)Khjh&hh'h(ubeh}(h]=what-exists-now-parsl-python-logs-vs-parsl-monitoring-recordsah]h]>what exists now: parsl python logs vs parsl monitoring recordsah]h!]uh%h*hjJh&hh'h(h)Kubh+)}(hhh](h0)}(h*Optional and missing data in observabilityh]h*Optional and missing data in observability}(hj/h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj,h&hh'h(h)Kubh)}(hlog levels - INFO vs DEBUGh]hlog levels - INFO vs DEBUG}(hj=h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubh)}(hOmissing log files - eg. start with parsl.log, add in more files for more detailh]hOmissing log files - eg. start with parsl.log, add in more files for more detail}(hjKh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubh)}(hSecurity - not so much the Parsl core use case, but eg GC executor vs GC endpoint logs vs GC central services have different security properties. Or in Academy, the hosted HTTP exchange.h]hSecurity - not so much the Parsl core use case, but eg GC executor vs GC endpoint logs vs GC central services have different security properties. Or in Academy, the hosted HTTP exchange.}(hjYh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubh)}(hThe observability approach needs to accomodate that, for any/all reasons, some events won't be there. There can't be a "complete set of events" to complain about being incomplete.h]hThe observability approach needs to accomodate that, for any/all reasons, some events won’t be there. There can’t be a “complete set of events” to complain about being incomplete.}(hjgh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubh)}(huless data, well the reports in whatever form are less informative, to the extent that the lack of data makes them so.h]huless data, well the reports in whatever form are less informative, to the extent that the lack of data makes them so.}(hjuh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubh)}(hdThis optionality aligns with different components adding their own logs, if they happen to be there.h]hdThis optionality aligns with different components adding their own logs, if they happen to be there.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubh)}(h;Adding (or removing) a log field is a lightweight operationh]h;Adding (or removing) a log field is a lightweight operation}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khj,h&hubeh}(h](*optional-and-missing-data-in-observabilityj#eh]h](*optional and missing data in observability partialdataeh]h!]uh%h*hjJh&hh'h(h)Kh}jjsh}j#jsubh+)}(hhh](h0)}(h Data typesh]h Data types}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Kubh)}(hdata types don't matter much for human observation. but for machine processing they do. so this section has some relevance when thinking about the Analysis section later.h]hdata types don’t matter much for human observation. but for machine processing they do. so this section has some relevance when thinking about the Analysis section later.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hBoth JSON and Python representation can support a range of types. But richer data types can exist. What's interesting for analysis is primarily relations like equality and ordering, which are implicitly not string-like in a lot of cases. For example:h]hBoth JSON and Python representation can support a range of types. But richer data types can exist. What’s interesting for analysis is primarily relations like equality and ordering, which are implicitly not string-like in a lot of cases. For example:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh block_quote)}(hXT* string vs int: '0' vs 0 as a Parsl task ID. Even Parsl source code is not entirely clear on when a task ID is a string and when it is an int, I think. Normalisation example: "03" vs "3". Rundir IDs are usually at least 3 digits long. Is rundir "000" the same as rundir 0? * UUID as 128 bit number, vs UUID as a case-sensitive/padding-sensitive ASCII/UTF-8 string. uuids should be used *more* in this work - they were invented for the purpose of this kind of distributed identification https://en.wikipedia.org/wiki/Universally_unique_identifier#History https://www.rfc-editor.org/rfc/rfc9562.html Normalisation example of string form: case differences. * ordinal relations of text-named log levels (WARN, WARNING, INFO, ERROR, ...) in various enumerations (although for querying an overarching schema is maybe possible for read-only ordering use) h]j)}(hhh](j)}(hXstring vs int: '0' vs 0 as a Parsl task ID. Even Parsl source code is not entirely clear on when a task ID is a string and when it is an int, I think. Normalisation example: "03" vs "3". Rundir IDs are usually at least 3 digits long. Is rundir "000" the same as rundir 0? h]h)}(hXstring vs int: '0' vs 0 as a Parsl task ID. Even Parsl source code is not entirely clear on when a task ID is a string and when it is an int, I think. Normalisation example: "03" vs "3". Rundir IDs are usually at least 3 digits long. Is rundir "000" the same as rundir 0?h]hXstring vs int: ‘0’ vs 0 as a Parsl task ID. Even Parsl source code is not entirely clear on when a task ID is a string and when it is an int, I think. Normalisation example: “03” vs “3”. Rundir IDs are usually at least 3 digits long. Is rundir “000” the same as rundir 0?}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)Khjubj)}(hX|UUID as 128 bit number, vs UUID as a case-sensitive/padding-sensitive ASCII/UTF-8 string. uuids should be used *more* in this work - they were invented for the purpose of this kind of distributed identification https://en.wikipedia.org/wiki/Universally_unique_identifier#History https://www.rfc-editor.org/rfc/rfc9562.html Normalisation example of string form: case differences. h]h)}(hX{UUID as 128 bit number, vs UUID as a case-sensitive/padding-sensitive ASCII/UTF-8 string. uuids should be used *more* in this work - they were invented for the purpose of this kind of distributed identification https://en.wikipedia.org/wiki/Universally_unique_identifier#History https://www.rfc-editor.org/rfc/rfc9562.html Normalisation example of string form: case differences.h](hoUUID as 128 bit number, vs UUID as a case-sensitive/padding-sensitive ASCII/UTF-8 string. uuids should be used }(hjh&hh'Nh)Nubh)}(h*more*h]hmore}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh^ in this work - they were invented for the purpose of this kind of distributed identification }(hjh&hh'Nh)Nubh)}(hChttps://en.wikipedia.org/wiki/Universally_unique_identifier#Historyh]hChttps://en.wikipedia.org/wiki/Universally_unique_identifier#History}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]refurijuh%hhjubh }(hjh&hh'Nh)Nubh)}(h+https://www.rfc-editor.org/rfc/rfc9562.htmlh]h+https://www.rfc-editor.org/rfc/rfc9562.html}(hj)h&hh'Nh)Nubah}(h]h]h]h]h!]refurij+uh%hhjubh9 Normalisation example of string form: case differences.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)Khjubj)}(hordinal relations of text-named log levels (WARN, WARNING, INFO, ERROR, ...) in various enumerations (although for querying an overarching schema is maybe possible for read-only ordering use) h]h)}(hordinal relations of text-named log levels (WARN, WARNING, INFO, ERROR, ...) in various enumerations (although for querying an overarching schema is maybe possible for read-only ordering use)h]hordinal relations of text-named log levels (WARN, WARNING, INFO, ERROR, …) in various enumerations (although for querying an overarching schema is maybe possible for read-only ordering use)}(hjLh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)KhjHubah}(h]h]h]h]h!]uh%jh'h(h)Khjubeh}(h]h]h]h]h!]jjuh%jh'h(h)Khjubah}(h]h]h]h]h!]uh%jh'h(h)Khjh&hubh)}(hEWhats the right canonicalisation attitude here? open question for me.h]hEWhats the right canonicalisation attitude here? open question for me.}(hjlh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hACannot expect emitters to conform to some defined canonical form.h]hACannot expect emitters to conform to some defined canonical form.}(hjzh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hXperhaps I should use rewrite_by_lambda(logs, keyname, lambda) to change known fields into a suitable object representation: int for some task IDs, UUIDs, log levels? then the type system deals with it? some of those could happen in the importers, some in the query, as its modular. I already do a rewrite to shift the created time down to 0 base, in one of my plots. So the notion is there already.h]hXperhaps I should use rewrite_by_lambda(logs, keyname, lambda) to change known fields into a suitable object representation: int for some task IDs, UUIDs, log levels? then the type system deals with it? some of those could happen in the importers, some in the query, as its modular. I already do a rewrite to shift the created time down to 0 base, in one of my plots. So the notion is there already.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubeh}(h] data-typesah]h] data typesah]h!]uh%h*hjJh&hh'h(h)Kubh+)}(hhh](h0)}(h.Distributed state machines - parsl issue #4021h]h.Distributed state machines - parsl issue #4021}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Kubh)}(hT"distributed" state machines are hard. see parsl-visualize bug #4021 for an example.h]hX“distributed” state machines are hard. see parsl-visualize bug #4021 for an example.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hdon't try to make the data model force this. the events happen when they happen. handle it on the query/processing side: you make whatever sense of it that you can. it's not the job of the recording side to force a model that isn't true.h]hdon’t try to make the data model force this. the events happen when they happen. handle it on the query/processing side: you make whatever sense of it that you can. it’s not the job of the recording side to force a model that isn’t true.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hXfor example, in the context of #4021, we might want to project some external opinion that running should "override" launched for a task, that isn't reflected in the emitting code/emitting event model at all, based on an artifical "force to single thread" concept of task execution. in the same vein, the #4021 suspect tasks have *negative* time in launched state. which sounds very weird for a non-distributed state machine model.h](hXSfor example, in the context of #4021, we might want to project some external opinion that running should “override” launched for a task, that isn’t reflected in the emitting code/emitting event model at all, based on an artifical “force to single thread” concept of task execution. in the same vein, the #4021 suspect tasks have }(hjh&hh'Nh)Nubh)}(h *negative*h]hnegative}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh[ time in launched state. which sounds very weird for a non-distributed state machine model.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hXor we might only want to visualize the running/end of running times and forget overlaying any other state model onto things: the running and running_ended times should at least be consistent wrt each other as they happen to come from a single threaded bit of code.h]hXor we might only want to visualize the running/end of running times and forget overlaying any other state model onto things: the running and running_ended times should at least be consistent wrt each other as they happen to come from a single threaded bit of code.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hXRsee also that the parsl TaskRecord never records a state of running or running_ended. despite it being a valid state. this already only exists elsewhere in the system as a reconstructed state machine - not a real single-threaded/non-distributed state machine. so parsl monitoring is already a demonstration of the violation of this model.h]hXRsee also that the parsl TaskRecord never records a state of running or running_ended. despite it being a valid state. this already only exists elsewhere in the system as a reconstructed state machine - not a real single-threaded/non-distributed state machine. so parsl monitoring is already a demonstration of the violation of this model.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubeh}(h]+distributed-state-machines-parsl-issue-4021ah]h].distributed state machines - parsl issue #4021ah]h!]uh%h*hjJh&hh'h(h)Kubh+)}(hhh](h0)}(h commercial observability vendorsh]h commercial observability vendors}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Kubh)}(hhh]h}(h]h]h]h]h!]hP]((h Cloudwatchindex-4hNt(h Honeycombj+hNtehuh%hh'h(h)Khjh&hubh)}(hhh]h}(h]h]h]h]h!]hj+uh%hhjh&hh'h(h)Kubh)}(h*honeycomb, or built into AWS as cloudwatchh]h*honeycomb, or built into AWS as cloudwatch}(hj8h&hh'Nh)Nubah}(h]j+ah]h]h]h!]uh%hh'h(h)Khjh&hh}h}j+j/subh)}(hmore ad-hoc construction, less buy in from components, rather than all working together to build a single platform, which is often how the commercial observability usecases are described.h]hmore ad-hoc construction, less buy in from components, rather than all working together to build a single platform, which is often how the commercial observability usecases are described.}(hjHh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Khjh&hubh)}(hhh]h}(h]h]h]h]h!]hP](h OpenTelemetryindex-5hNtahuh%hh'h(h)Khjh&hubh)}(hhh]h}(h]h]h]h]h!]hjauh%hhjh&hh'h(h)Kubh)}(hDOpenTelemetry as a standard. How does this related to that standard?h]hDOpenTelemetry as a standard. How does this related to that standard?}(hjlh&hh'Nh)Nubah}(h]jaah]h]h]h!]uh%hh'h(h)Mhjh&hh}h}jajcsubh)}(h4TODO: maybe opentelemetry is better in :ref:`moving`h](h'TODO: maybe opentelemetry is better in }(hj|h&hh'Nh)NubjQ)}(h :ref:`moving`h]jV)}(hjh]hmoving}(hjh&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhjubah}(h]h]h]h]h!]refdochO refdomainjreftyperef refexplicitrefwarnjsmovinguh%jPh'h(h)Mhj|ubeh}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubeh}(h] commercial-observability-vendorsah]h] commercial observability vendorsah]h!]uh%h*hjJh&hh'h(h)Kubh+)}(hhh](h0)}(h(The argument for templating log messagesh]h(The argument for templating log messages}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hLprevious argument: avoids string interpolation if message will be discarded.h]hLprevious argument: avoids string interpolation if message will be discarded.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hgnew argument: we can use the template to find "the same" log message even when its interpolations vary.h]hknew argument: we can use the template to find “the same” log message even when its interpolations vary.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hjh&hubh)}(hadvanced level question: when a task changes state, should the template be interpolated or not with the state names? because in my example query, it is relevant to see those changes *not* as templated away.h](hadvanced level question: when a task changes state, should the template be interpolated or not with the state names? because in my example query, it is relevant to see those changes }(hjh&hh'Nh)Nubh)}(h*not*h]hnot}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh as templated away.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M hjh&hubeh}(h](the-argument-for-templating-log-messagesah]h](the argument for templating log messagesah]h!]uh%h*hjJh&hh'h(h)Mubh+)}(hhh](h0)}(hObjects and spansh]hObjects and spans}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mubh)}(h=it is a "thing we want to talk about", as a very weak notion.h]hAit is a “thing we want to talk about”, as a very weak notion.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hweak notion of "object" but it does exist: for example, logs that are about a particular Parsl task, or a particular HTEX worker, or a particular batch job.h]hweak notion of “object” but it does exist: for example, logs that are about a particular Parsl task, or a particular HTEX worker, or a particular batch job.}(hj& h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hFmulti-attribute keys - sometimes hierarchical but that isn't required.h]hHmulti-attribute keys - sometimes hierarchical but that isn’t required.}(hj4 h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hXeg. contrasting parsl task IDs vs parsl checkpoint IDs: in parsl checkpoint world, tasks are identified by their hashsum. there might be many tasks that run to compute that result. when working cross-dfk checkpointing, the cross-dfk sort-of-task ID is that hash sum, in the sense of correlating tasks that are elided due to memoization with their original exec_done task in another run. (so hashsum is one form of task ID, parsl_dfk/parsl_task_id is another - both legitimate but both different)h]hXeg. contrasting parsl task IDs vs parsl checkpoint IDs: in parsl checkpoint world, tasks are identified by their hashsum. there might be many tasks that run to compute that result. when working cross-dfk checkpointing, the cross-dfk sort-of-task ID is that hash sum, in the sense of correlating tasks that are elided due to memoization with their original exec_done task in another run. (so hashsum is one form of task ID, parsl_dfk/parsl_task_id is another - both legitimate but both different)}(hjB h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hGcross-ref "span" concept from other places in more broad Observability.h]hKcross-ref “span” concept from other places in more broad Observability.}(hjP h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hhh]h}(h]h]h]h]h!]hP](hEntity component systemindex-6hNtahuh%hh'h(h)Mhj h&hubh)}(hhh]h}(h]h]h]h]h!]hji uh%hhj h&hh'h(h)M ubh)}(hX*Compare with https://en.wikipedia.org/wiki/Entity_component_system which has identified entities, where the entity only has identity and no other substance, along with components as "characterising an entity as having a partcular aspect" and the "system" which deals with all the entities having particular components. This fits the partial data model fairly well, I think: the notion of identifying entities, without those entities having any further structure; and then the data you might expect to find about certain entities being orthogonal to that.h](h Compare with }(hjt h&hh'Nh)Nubh)}(h5https://en.wikipedia.org/wiki/Entity_component_systemh]h5https://en.wikipedia.org/wiki/Entity_component_system}(hj| h&hh'Nh)Nubah}(h]h]h]h]h!]refurij~ uh%hhjt ubhX which has identified entities, where the entity only has identity and no other substance, along with components as “characterising an entity as having a partcular aspect” and the “system” which deals with all the entities having particular components. This fits the partial data model fairly well, I think: the notion of identifying entities, without those entities having any further structure; and then the data you might expect to find about certain entities being orthogonal to that.}(hjt h&hh'Nh)Nubeh}(h]ji ah]h]h]h!]uh%hh'h(h)M!hj h&hh}h}ji jk subeh}(h]objects-and-spansah]h]objects and spansah]h!]uh%h*hjJh&hh'h(h)Mubh+)}(hhh](h0)}(h Entity keysh]h Entity keys}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)M$ubh)}(hThis is probably the fundamental problem of JOIN here, compared to traditional observability which passes request IDs around up front.h]hThis is probably the fundamental problem of JOIN here, compared to traditional observability which passes request IDs around up front.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M&hj h&hubh)}(hIn a traditional distributed object model system, you would use something like UUIDs everywhere. However, this observability work is not observing a traditional distributed object model system.h]hIn a traditional distributed object model system, you would use something like UUIDs everywhere. However, this observability work is not observing a traditional distributed object model system.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M(hj h&hubh)}(hX%note that in parsl some IDs are deliberately not known across the system at runtime because it would be expensive to correlate them in realtime, and that is not necessary for the executing-tasks part of Parsl, even though its necessary for the understanding-how-that-task-was-executed section.h]hX%note that in parsl some IDs are deliberately not known across the system at runtime because it would be expensive to correlate them in realtime, and that is not necessary for the executing-tasks part of Parsl, even though its necessary for the understanding-how-that-task-was-executed section.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M*hj h&hubeh}(h] entity-keysah]h] entity keysah]h!]uh%h*hjJh&hh'h(h)M$ubh+)}(hhh](h0)}(hOther componentsh]hOther components}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)M-ubh)}(hXkrewrite this to not be parsl-centric but instead talk about integrating "objects" from different components even though those components are not strongly aware of each other. wq vs parsl task id is a nice example: regularly used, but log files are different formats, identifier space is different, cardinality of tasks is different: one parsl task != one wq task.h]hXorewrite this to not be parsl-centric but instead talk about integrating “objects” from different components even though those components are not strongly aware of each other. wq vs parsl task id is a nice example: regularly used, but log files are different formats, identifier space is different, cardinality of tasks is different: one parsl task != one wq task.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M/hj h&hubh)}(hSome components aren't Parsl-aware: for example work queue has no notion of a Parsl task ID. and it runs its own logging system, that is not Python, and so not amenable to Python monitoring radios.h]hSome components aren’t Parsl-aware: for example work queue has no notion of a Parsl task ID. and it runs its own logging system, that is not Python, and so not amenable to Python monitoring radios.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M1hj h&hubh)}(hhh]h}(h]h]h]h]h!]hP](hZMQindex-7hNtahuh%hh'h(h)M6hj h&hubh)}(hhh]h}(h]h]h]h]h!]hj uh%hhj h&hh'h(h)M7ubh)}(hlZMQ generates log messages which have been useful sometimes and these could be gatewayed into observability.h]hlZMQ generates log messages which have been useful sometimes and these could be gatewayed into observability.}(hj% h&hh'Nh)Nubah}(h]j ah]h]h]h!]uh%hh'h(h)M8hj h&hh}h}j j subh)}(hXinherently chaotic research prototypes can benefit from observability - as part of building and debugging them, rather than a post-completion 2nd generation feature - but that is impeded by requiring a strict sql-like data model to exist, when the research prototype is not ready for that. (see attitude that monitoring is something aimed at "users" later on, not something that is aimed at "developers" understanding the behaviour of what they have created)h]hXinherently chaotic research prototypes can benefit from observability - as part of building and debugging them, rather than a post-completion 2nd generation feature - but that is impeded by requiring a strict sql-like data model to exist, when the research prototype is not ready for that. (see attitude that monitoring is something aimed at “users” later on, not something that is aimed at “developers” understanding the behaviour of what they have created)}(hj5 h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M;hj h&hubh)}(hXH"user" applications adding their own events, and expecting those events to be correlatable with everything else that is happening, is part of the model: just as we might expect Globus Compute endpoint logs to be correlatable with parsl htex logs, even though Globus Compute is a "mere" user of Parsl, not a "real part" of Parsl.h]hXT“user” applications adding their own events, and expecting those events to be correlatable with everything else that is happening, is part of the model: just as we might expect Globus Compute endpoint logs to be correlatable with parsl htex logs, even though Globus Compute is a “mere” user of Parsl, not a “real part” of Parsl.}(hjC h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M=hj h&hubh)}(hshould be easy to add other events - the core observability model shouldn't be prescriptive about what events exist, what they look like. even though someone needs to know what their structure is.h]hshould be easy to add other events - the core observability model shouldn’t be prescriptive about what events exist, what they look like. even though someone needs to know what their structure is.}(hjQ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M?hj h&hubh)}(h@to that end, there is no core schema, either formal or informal.h]h@to that end, there is no core schema, either formal or informal.}(hj_ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MChj h&hubh)}(hX observability records do not even need to have a timestamp, in the sense of a log message. for example, see relations imported from a relational database into observability records, in parsl monitoring import (crossref usecase about plotting from monitoring database)h]hX observability records do not even need to have a timestamp, in the sense of a log message. for example, see relations imported from a relational database into observability records, in parsl monitoring import (crossref usecase about plotting from monitoring database)}(hjm h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MEhj h&hubh)}(hParsl contains two "mini-workflow-systems" on top of core Parsl: parsl-perf and pytest tests. It could be interesting to illustrate how those fit in without being a core part of Parsl observability.h]hParsl contains two “mini-workflow-systems” on top of core Parsl: parsl-perf and pytest tests. It could be interesting to illustrate how those fit in without being a core part of Parsl observability.}(hj{ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MGhj h&hubh)}(hX!Parsl monitoring visualisation and Parsl logging are both completely unaware of the application level structure of the mini-workflows run by parsl-perf and pytest, beyond what is expressed to the DFK as DAG fragments: there's nothing to separate out parsl-perf iterations, or pytest tests.h]hX#Parsl monitoring visualisation and Parsl logging are both completely unaware of the application level structure of the mini-workflows run by parsl-perf and pytest, beyond what is expressed to the DFK as DAG fragments: there’s nothing to separate out parsl-perf iterations, or pytest tests.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MIhj h&hubh)}(h:In the context of pytest, see: :ref:`pytest-observes-logs`h](hIn the context of pytest, see: }(hj h&hh'Nh)NubjQ)}(h:ref:`pytest-observes-logs`h]jV)}(hj h]hpytest-observes-logs}(hj h&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhj ubah}(h]h]h]h]h!]refdochO refdomainj reftyperef refexplicitrefwarnjspytest-observes-logsuh%jPh'h(h)MKhj ubeh}(h]h]h]h]h!]uh%hh'h(h)MKhj h&hubh)}(hhh]h}(h]h]h]h]h!]hP](hcolmenaindex-8hNtahuh%hh'h(h)MMhj h&hubh)}(hhh]h}(h]h]h]h]h!]hj uh%hhj h&hh'h(h)MNubh)}(hColmenah]hColmena}(hj h&hh'Nh)Nubah}(h]j ah]h]h]h!]uh%hh'h(h)MOhj h&hh}h}j j subh)}(h .. _creating:h]h}(h]h]h]h]h!]hcreatinguh%hh)MQhj h&hh'h(ubeh}(h]other-componentsah]h]other componentsah]h!]uh%h*hjJh&hh'h(h)M-ubeh}(h](the-data-modelj9eh]h](the data model datamodeleh]h!]uh%h*hhh&hh'h(h)Kh}j j/sh}j9j/subh+)}(hhh](h0)}(hGenerating wide recordsh]hGenerating wide records}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)MTubh+)}(hhh](h0)}(hWhat exists nowh]hWhat exists now}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)MWubh+)}(hhh](h0)}(hParslh]hParsl}(hj. h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj+ h&hh'h(h)M[ubh)}(h^Parsl generates a lot of observability-style events, but spread across many different formats.h]h^Parsl generates a lot of observability-style events, but spread across many different formats.}(hj< h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M]hj+ h&hubh)}(hXBParsl logs - not well structured, for example overlapping DFKs are not well represented, and actions to do with different tasks can be interleaved without being clearly separated/identified. Globus Compute deliberately makes them even less structured, by jumbling up the file-based logs of multiple runs into one directoryh]hXBParsl logs - not well structured, for example overlapping DFKs are not well represented, and actions to do with different tasks can be interleaved without being clearly separated/identified. Globus Compute deliberately makes them even less structured, by jumbling up the file-based logs of multiple runs into one directory}(hjJ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M_hj+ h&hubh)}(hXParsl monitoring - well structured but very hard to modify. It is easy to query for questions that it can answer, and hard to use for anything more. Users are generally interested in using it when they discover it, but it suffers from a history of being built as student demo projects dropped into production. An example of a question it cannot answer: What is the ``parsl_resource_specification`` for a task?h](hXmParsl monitoring - well structured but very hard to modify. It is easy to query for questions that it can answer, and hard to use for anything more. Users are generally interested in using it when they discover it, but it suffers from a history of being built as student demo projects dropped into production. An example of a question it cannot answer: What is the }(hjX h&hh'Nh)Nubjn)}(h ``parsl_resource_specification``h]hparsl_resource_specification}(hj` h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjX ubh for a task?}(hjX h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mehj+ h&hubh)}(hhh]h}(h]h]h]h]h!]hP](h Work Queueindex-9hNtahuh%hh'h(h)Mlhj+ h&hubh)}(hhh]h}(h]h]h]h]h!]hj uh%hhj+ h&hh'h(h)Mmubh)}(hXWork Queue Logs - well structured. Ignored by Parsl monitoring. Hard to correlate with monitoring.db: Work Queue uses work queue task IDs, but Parsl Monitoring uses Parsl Task and Try IDs. Correlating those is a motivating use case for this Observabilty projected. [TODO: make that correlation as an explicit use-case section, explaining what needs to change, how it is done manually now]h]hXWork Queue Logs - well structured. Ignored by Parsl monitoring. Hard to correlate with monitoring.db: Work Queue uses work queue task IDs, but Parsl Monitoring uses Parsl Task and Try IDs. Correlating those is a motivating use case for this Observabilty projected. [TODO: make that correlation as an explicit use-case section, explaining what needs to change, how it is done manually now]}(hj h&hh'Nh)Nubah}(h]j ah]h]h]h!]uh%hh'h(h)Mnhj+ h&hh}h}j j subeh}(h]parslah]h]parslah]h!]uh%h*hj h&hh'h(h)M[ubh+)}(hhh](h0)}(hAcademyh]hAcademy}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mvubh)}(hAs a more in-development project, it is much better placed to make observability records from the start as a first-order production feature.h]hAs a more in-development project, it is much better placed to make observability records from the start as a first-order production feature.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mxhj h&hubeh}(h]academyah]h]academyah]h!]uh%h*hj h&hh'h(h)Mvubeh}(h]what-exists-nowah]h]what exists nowah]h!]uh%h*hj h&hh'h(h)MWubh+)}(hhh](h0)}(h"New Python Code for log generationh]h"New Python Code for log generation}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)M~ubh)}(h[Acknowledging observability as a first-order feature means we can make big changes to code.h]h[Acknowledging observability as a first-order feature means we can make big changes to code.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hXEvery log message needs to be visited to add context. In many places a bunch of that context can be added by helpers: for example, in my prototype, some module level loggers are replaced by object-level loggers: there is a per-task logger (actually LoggerAdapter) in the TaskRecord, and logging to that automatically adds on relevant DFK and task metadata: at most log sites, the change to add that metadata is to switch from invoking methods on the module-level logger object, invoking them on the new task-level logger instead.h]hXEvery log message needs to be visited to add context. In many places a bunch of that context can be added by helpers: for example, in my prototype, some module level loggers are replaced by object-level loggers: there is a per-task logger (actually LoggerAdapter) in the TaskRecord, and logging to that automatically adds on relevant DFK and task metadata: at most log sites, the change to add that metadata is to switch from invoking methods on the module-level logger object, invoking them on the new task-level logger instead.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hSome log lines bracket an operation, and to help with that, my prototype introduces a LexicalSpan context manager which can be used as part of a `with` block to identify the span of work starting and ending.h](hSome log lines bracket an operation, and to help with that, my prototype introduces a LexicalSpan context manager which can be used as part of a }(hj h&hh'Nh)Nubh title_reference)}(h`with`h]hwith}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hj ubh8 block to identify the span of work starting and ending.}(hj h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hMove away from forming ad-hoc string templates and make log calls look more machine-readable. This is somewhat stylistic: with task ID automatically logged, there is no need to substitute in task ID in some arbitrary subset of task-related logs.h]hMove away from forming ad-hoc string templates and make log calls look more machine-readable. This is somewhat stylistic: with task ID automatically logged, there is no need to substitute in task ID in some arbitrary subset of task-related logs.}(hj$ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(h9TODO: describe academy style that I tried out in PR #NNN:h]h9TODO: describe academy style that I tried out in PR #NNN:}(hj2 h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubjL)}(h,extra=myobj.log_extra() | { "some": "more" }h]h,extra=myobj.log_extra() | { "some": "more" }}hj@ sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^j_uh%jKh'h(h)Mhj h&hubh)}(hXParsl config hook for arbitrary log initialization - actually it can do "anything" at process init and maybe that's interesting from a different perspective (because its a callback/plugin), but from the perspective of this report I don't care about non-log uses.h]hXParsl config hook for arbitrary log initialization - actually it can do “anything” at process init and maybe that’s interesting from a different perspective (because its a callback/plugin), but from the perspective of this report I don’t care about non-log uses.}(hjQ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hXBe aware that there are non-Python bits of code generating various logs. Work Queue (still structured) logs are one example. The output from batch command submit scripts are another less structured one that looks much more like a traditional chaotic output file.h]hXBe aware that there are non-Python bits of code generating various logs. Work Queue (still structured) logs are one example. The output from batch command submit scripts are another less structured one that looks much more like a traditional chaotic output file.}(hj_ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh+)}(hhh](h0)}(hPython API on logging sideh]hPython API on logging side}(hjp h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjm h&hh'h(h)Mubh)}(hKUse Python-provided logger interface, with Python-provided ```extra``` API.h](h;Use Python-provided logger interface, with Python-provided }(hj~ h&hh'Nh)Nubjn)}(h ```extra```h]h`extra`}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj~ ubh API.}(hj~ h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhjm h&hubh)}(hper-class "log extras" method that generates an ``extra`` dict about this object. that pushes on (like ``repr``) it being the responsibility of the object to describe itself, rather than being someone elses responsibility.h](h4per-class “log extras” method that generates an }(hj h&hh'Nh)Nubjn)}(h ``extra``h]hextra}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj ubh. dict about this object. that pushes on (like }(hj h&hh'Nh)Nubjn)}(h``repr``h]hrepr}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj ubho) it being the responsibility of the object to describe itself, rather than being someone elses responsibility.}(hj h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhjm h&hubeh}(h]python-api-on-logging-sideah]h]python api on logging sideah]h!]uh%h*hj h&hh'h(h)Mubh+)}(hhh](h0)}(h-anonymous/temporary identified python objectsh]h-anonymous/temporary identified python objects}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mubh)}(hpython objects don't have a global-over-time ID. id() exists but it is reused over time so awkward to use over a whole series of logs. so some objects should get a uuid "just" for observability - UUIDs were invented for this.h]hpython objects don’t have a global-over-time ID. id() exists but it is reused over time so awkward to use over a whole series of logs. so some objects should get a uuid “just” for observability - UUIDs were invented for this.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hXlikewise e.g. gc endpoints don't have a DFK ID, but endpoint id/executor/block 0 isn't a global-over-time ID: there's a new block 0 at every restart? or is there a unique UEP ID each time that is enough? I don't think so because i see overlapping block-0 entries.h]hXlikewise e.g. gc endpoints don’t have a DFK ID, but endpoint id/executor/block 0 isn’t a global-over-time ID: there’s a new block 0 at every restart? or is there a unique UEP ID each time that is enough? I don’t think so because i see overlapping block-0 entries.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hXrepr-of-ID-object might not be the correct format for logging: I want stuff that is nice strings for values, but repr (although it is a string) is more designed to look like a python code fragment rather than the core value of an object. Maybe ``str`` is better, and maybe some other way of representing the ID is better? The point is to have values that work well in aggregate, database style analysis, not easy on the human eye.h](hrepr-of-ID-object might not be the correct format for logging: I want stuff that is nice strings for values, but repr (although it is a string) is more designed to look like a python code fragment rather than the core value of an object. Maybe }(hj h&hh'Nh)Nubjn)}(h``str``h]hstr}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj ubh is better, and maybe some other way of representing the ID is better? The point is to have values that work well in aggregate, database style analysis, not easy on the human eye.}(hj h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubeh}(h]-anonymous-temporary-identified-python-objectsah]h]-anonymous/temporary identified python objectsah]h!]uh%h*hj h&hh'h(h)Mubh+)}(hhh](h0)}(h6Contributed: Modifying academy to generate wide eventsh]h6Contributed: Modifying academy to generate wide events}(hj0 h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj- h&hh'h(h)Mubh)}(h"summarise the PRs I merged alreadyh]h"summarise the PRs I merged already}(hj> h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj- h&hubh)}(hFcross-ref event graph in analysis section as something enabled by thish]hFcross-ref event graph in analysis section as something enabled by this}(hjL h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj- h&hubeh}(h]5contributed-modifying-academy-to-generate-wide-eventsah]h]6contributed: modifying academy to generate wide eventsah]h!]uh%h*hj h&hh'h(h)Mubeh}(h]"new-python-code-for-log-generationah]h]"new python code for log generationah]h!]uh%h*hj h&hh'h(h)M~ubh+)}(hhh](h0)}(h"Translating non-wide-event sourcesh]h"Translating non-wide-event sources}(hjm h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjj h&hh'h(h)Mubh)}(hPart of this modularity work is that some modules produce event-like information that looks superficially very different but that can be understood through the lens of structured event records.h]hPart of this modularity work is that some modules produce event-like information that looks superficially very different but that can be understood through the lens of structured event records.}(hj{ h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjj h&hubh+)}(hhh](h0)}(h*Using Parsl monitoring events as wide logsh]h*Using Parsl monitoring events as wide logs}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mubh)}(htwo approaches:h]htwo approaches:}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hmonitoring.json: abandons the SQL database component of conventional Parsl monitoring and instead writes each monitoring message out to a json file, giving an event stream.h]hmonitoring.json: abandons the SQL database component of conventional Parsl monitoring and instead writes each monitoring message out to a json file, giving an event stream.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hXreplay-monitoring.db: turns a monitoring.db file into events. the status, resource and block tables already looks like an event stream. This gives an easy way to take existing runs and turn them into event streams without needing to opt-in to any of the other JSON logging, or changing anything at all at runtime: anything new is entirely post-facto. which fits the general concept of doing things post-facto in parsl observability.h]hXreplay-monitoring.db: turns a monitoring.db file into events. the status, resource and block tables already looks like an event stream. This gives an easy way to take existing runs and turn them into event streams without needing to opt-in to any of the other JSON logging, or changing anything at all at runtime: anything new is entirely post-facto. which fits the general concept of doing things post-facto in parsl observability.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hthe infrastructure for this already exists, which means that the query side of this project can be used without modification of the execution-side Parsl environment.h]hthe infrastructure for this already exists, which means that the query side of this project can be used without modification of the execution-side Parsl environment.}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(h.see earlier use case on priority visualizationh]h.see earlier use case on priority visualization}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubeh}(h]*using-parsl-monitoring-events-as-wide-logsah]h]*using parsl monitoring events as wide logsah]h!]uh%h*hjj h&hh'h(h)Mubh+)}(hhh](h0)}(h9Using Work Queue ``transaction_log`` as a wide log sourceh](hUsing Work Queue }(hj h&hh'Nh)Nubjn)}(h``transaction_log``h]htransaction_log}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj ubh as a wide log source}(hj h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP](h Work Queueindex-10hNtahuh%hh'h(h)Mhj h&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhj h&hh'h(h)Mubh)}(hthis is a core part of seeing beyond the pure-Parsl code. it's well structured but not JSON. translation into JSON is mostly syntactic. and can be done line-by-line, aka streaming.h]hthis is a core part of seeing beyond the pure-Parsl code. it’s well structured but not JSON. translation into JSON is mostly syntactic. and can be done line-by-line, aka streaming.}(hj!h&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)Mhj h&hh}h}jjsubh)}(hTODO: example log lineh]hTODO: example log line}(hj1h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubh)}(hnote that work queue task IDs are not Parsl task IDs: data from the monitoring database cannot be correlated with data from the work queue transaction log! (without further help from the parsl JSON log files...)h]hnote that work queue task IDs are not Parsl task IDs: data from the monitoring database cannot be correlated with data from the work queue transaction log! (without further help from the parsl JSON log files…)}(hj?h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj h&hubeh}(h]5using-work-queue-transaction-log-as-a-wide-log-sourceah]h]5using work queue transaction_log as a wide log sourceah]h!]uh%h*hjj h&hh'h(h)Mubeh}(h]"translating-non-wide-event-sourcesah]h]"translating non-wide-event sourcesah]h!]uh%h*hj h&hh'h(h)Mubh+)}(hhh](h0)}(hAAdventure: adding observability to a prototype: idris2interchangeh]hAAdventure: adding observability to a prototype: idris2interchange}(hj`h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj]h&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP]((hidris2index-11hNt(h%High Throughput Executor; interchangejyhNt(h9Adventure; Adding observability to the idris2 interchangejyhNt(hProgramming languages; idris2jyhNtehuh%hh'h(h)Mhj]h&hubh)}(hhh]h}(h]h]h]h]h!]hjyuh%hhj]h&hh'h(h)Mubh)}(hAnother example: swap out interchange impl for a different one with a different internal model: a schema of events for task progress through the original interchange doesn't necessarily work for some other implementation.h]hAnother example: swap out interchange impl for a different one with a different internal model: a schema of events for task progress through the original interchange doesn’t necessarily work for some other implementation.}(hjh&hh'Nh)Nubah}(h]jyah]h]h]h!]uh%hh'h(h)Mhj]h&hh}h}jyjsubh)}(hidris2interchange - i want to debug stuff, not be told by the observability system HAHA we don't support your prototyping. in some sense thats exactly the time I *need* the observability system to be helping me. not later on when it all works.h](hidris2interchange - i want to debug stuff, not be told by the observability system HAHA we don’t support your prototyping. in some sense thats exactly the time I }(hjh&hh'Nh)Nubh)}(h*need*h]hneed}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubhK the observability system to be helping me. not later on when it all works.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hidris2interchange project is not aimed at producing production code. *ever*. in that sense it is very similar to some student projects that interact with parsl.h](hEidris2interchange project is not aimed at producing production code. }(hjh&hh'Nh)Nubh)}(h*ever*h]hever}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubhU. in that sense it is very similar to some student projects that interact with parsl.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hXmini-journal: what did i have to do to support idris2 logging? * make log records JSON format instead of textual - prior format was timestamp / string. theres a json library but to start with this records are so simple i'll template them in. * also already had a simple log-of-value mechanism in there already which readily translates to logging a template, a full message, and the value as separate fields.h]hXmini-journal: what did i have to do to support idris2 logging? * make log records JSON format instead of textual - prior format was timestamp / string. theres a json library but to start with this records are so simple i’ll template them in. * also already had a simple log-of-value mechanism in there already which readily translates to logging a template, a full message, and the value as separate fields.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hX/now there are json records going to the console. I don't trust the string escaping, but i'll deal with that ad-hoc. but also: needs to go to a file; if i want it to interact with other log files, I need some common keys. htex_task_id is the obvious one there for task correlation. manager ID is another.h]hX3now there are json records going to the console. I don’t trust the string escaping, but i’ll deal with that ad-hoc. but also: needs to go to a file; if i want it to interact with other log files, I need some common keys. htex_task_id is the obvious one there for task correlation. manager ID is another.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hTo go to a file: lazy redirect of stdout to idris2interchange.log. This could be done more seriously to avoid random prints going to the file but this is a prototype so I don't care.h]hTo go to a file: lazy redirect of stdout to idris2interchange.log. This could be done more seriously to avoid random prints going to the file but this is a prototype so I don’t care.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hhh]h}(h]h]h]h]h!]hP](htool; jqindex-12hNtahuh%hh'h(h)Mhj]h&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhj]h&hh'h(h)Mubh)}(hRun it through `jq` for basic validation and haha its broken. I got confused about JSON quotes vs Python style quotes. Various iterations of `jq` vs formatting fixes to work towards `jq` believing this is valid.h](hRun it through }(hjh&hh'Nh)Nubj )}(h`jq`h]hjq}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubhz for basic validation and haha its broken. I got confused about JSON quotes vs Python style quotes. Various iterations of }(hjh&hh'Nh)Nubj )}(h`jq`h]hjq}(hj4h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubh% vs formatting fixes to work towards }(hjh&hh'Nh)Nubj )}(h`jq`h]hjq}(hjFh&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubh believing this is valid.}(hjh&hh'Nh)Nubeh}(h]jah]h]h]h!]uh%hh'h(h)Mhj]h&hh}h}jjsubh )}(h@code: parse error: Invalid numeric literal at line 1, column 11h]h@code: parse error: Invalid numeric literal at line 1, column 11}hj`sbah}(h]h]h]h]h!]h#h$uh%h hj]h&hh'h(h)Mubh)}(hX:That log escaping, which i implemented pretty quickly, seems to make logging extremely slow - especially outputting the pickle stack which is actually quite a big representation when it has a manager registration with all my installed python packages in there. but hey thats what log levels/log optionality is for.h]hX:That log escaping, which i implemented pretty quickly, seems to make logging extremely slow - especially outputting the pickle stack which is actually quite a big representation when it has a manager registration with all my installed python packages in there. but hey thats what log levels/log optionality is for.}(hjnh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hXILet's do some scripting to figure out which of these lines is so expensive - based on line length. one line is 49kb long! (its repeating the full pickled task state rather than a task id!). and similar with manager IDs. but this is probably the sort of changes I'll be needing to make to tie stuff in with other log files anyway.h]hXMLet’s do some scripting to figure out which of these lines is so expensive - based on line length. one line is 49kb long! (its repeating the full pickled task state rather than a task id!). and similar with manager IDs. but this is probably the sort of changes I’ll be needing to make to tie stuff in with other log files anyway.}(hj|h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh )}(hcode: with open("pytest-parsl/parsltest-current/runinfo/000/htex_local/idris2interchange.jsonlog", "r") as f: ls = f.readlines() ls.sort(key=lambda l: len(l)) print(ls[-1]) print(f"with size {len(ls[-1])} chars")h]hcode: with open("pytest-parsl/parsltest-current/runinfo/000/htex_local/idris2interchange.jsonlog", "r") as f: ls = f.readlines() ls.sort(key=lambda l: len(l)) print(ls[-1]) print(f"with size {len(ls[-1])} chars")}hjsbah}(h]h]h]h]h!]h#h$uh%h hj]h&hh'h(h)Mubh)}(hThis log volume has been a problem for me elsewhere, even without structured logging, filling up eg my root filesystem with docker stdout logs.h]hThis log volume has been a problem for me elsewhere, even without structured logging, filling up eg my root filesystem with docker stdout logs.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(h Now back to ``jq`` validation...h](h Now back to }(hjh&hh'Nh)Nubjn)}(h``jq``h]hjq}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh validation…}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hif i get that done... look for every logv call and report each one and how many times it logged a value. this is in the direction of logging metrics, without actually being that.h]hif i get that done… look for every logv call and report each one and how many times it logged a value. this is in the direction of logging metrics, without actually being that.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(h8a pytest run now give 92000 idris2interchange log lines.h]h8a pytest run now give 92000 idris2interchange log lines.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hand now jq accepts it all.h]hand now jq accepts it all.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hYso lets see if parsl.obserability.load_jsons can load it. it can, without further change.h]hYso lets see if parsl.obserability.load_jsons can load it. it can, without further change.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj]h&hubh)}(hlogs that have a v:h]hlogs that have a v:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M!hj]h&hubh )}(hcode: import parsl.observability.getlogs as gl logs = gl.load_jsons("pytest-parsl/parsltest-current/runinfo/000/htex_local/idris2interchange.jsonlog") vals = [l for l in logs if 'v' in l]h]hcode: import parsl.observability.getlogs as gl logs = gl.load_jsons("pytest-parsl/parsltest-current/runinfo/000/htex_local/idris2interchange.jsonlog") vals = [l for l in logs if 'v' in l]}hj sbah}(h]h]h]h]h!]h#h$uh%h hj]h&hh'h(h)M(ubh )}(h=code: >>> vkeys = {l['msg'] for l in vals} >>> len(vkeys) 52h]h=code: >>> vkeys = {l['msg'] for l in vals} >>> len(vkeys) 52}hjsbah}(h]h]h]h]h!]h#h$uh%h hj]h&hh'h(h)M.ubh)}(hNext step is to figure out how task processing can be annotated to fit into the general task flow findcommon style output. Let's start with a single line such as this without trying to add any broader context.h]hNext step is to figure out how task processing can be annotated to fit into the general task flow findcommon style output. Let’s start with a single line such as this without trying to add any broader context.}(hj(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M/hj]h&hubh)}(hxMake a new logv that lets the v field be named. That allows a single association to be made. which is ok for this stage.h]hxMake a new logv that lets the v field be named. That allows a single association to be made. which is ok for this stage.}(hj6h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M1hj]h&hubh )}(h*code: Dispatching task: PickleInteger 356h]h*code: Dispatching task: PickleInteger 356}hjDsbah}(h]h]h]h]h!]h#h$uh%h hj]h&hh'h(h)M6ubh)}(hNFirst lets format that task ID properly, without 'PickleInteger' in the value.h]hRFirst lets format that task ID properly, without ‘PickleInteger’ in the value.}(hjRh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M7hj]h&hubh)}(h"so now log records look like this:h]h"so now log records look like this:}(hj`h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M9hj]h&hubh )}(h}code: {'created': 1762875362.650125, 'formatted': 'Dispatching task: 358', 'msg': 'Dispatching task', 'htex_task_id': '358'}h]h}code: {'created': 1762875362.650125, 'formatted': 'Dispatching task: 358', 'msg': 'Dispatching task', 'htex_task_id': '358'}}hjnsbah}(h]h]h]h]h!]h#h$uh%h hj]h&hh'h(h)M>ubh)}(h ```htex_task_id```, or if doing so at a higher level ```(dfk,executor,submit_pass_id)=>htex_task_id```h](hcThis is an example of sending a join back in time. and an example of having to have the definition }(hjh&hh'Nh)Nubh)}(h *somewhere*h]h somewhere}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubhXU that these things are related - but that it doesn’t have to be in the logging code where we prefer to be fast and stateless. Also a library call that finds an htex task id on any record of a group and widens out all the others to have the same id: look for “these keys” in groups identified by “these keys” and make them global. (}(hjh&hh'Nh)Nubjn)}(h```widen_implication```h]h`widen_implication`}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh\ or some functional-dependency related name?). in this case, for the interchange log file, }(hjh&hh'Nh)Nubjn)}(h```submit_pass_id```h]h`submit_pass_id`}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh => }(hjh&hh'Nh)Nubjn)}(h```htex_task_id```h]h`htex_task_id`}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh#, or if doing so at a higher level }(hjh&hh'Nh)Nubjn)}(h1```(dfk,executor,submit_pass_id)=>htex_task_id```h]h-`(dfk,executor,submit_pass_id)=>htex_task_id`}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubeh}(h]h]h]h]h!]uh%hh'h(h)Mmhj]h&hubh)}(hTODO: show task1 output before join. then implement join and show task1 output with the rest of the decode span in there - the deserialisation of the task and execution of the matchmaker is shown now.h]hTODO: show task1 output before join. then implement join and show task1 output with the rest of the decode span in there - the deserialisation of the task and execution of the matchmaker is shown now.}(hj4h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mqhj]h&hubh)}(h2TODO: add in result handling span in the same way.h]h2TODO: add in result handling span in the same way.}(hjBh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mshj]h&hubh)}(hXLwidening submit_pass_id using key implication widening after loading/processing all the logs normally, which is what I'd expect if adding in ad-hoc hack stuff outside of the core parsl log loaders... has revealed some fixpoint related stuff: widening to htex_task_id which is the actual known ID isn't sufficient because the widening of htex_task_id to parsl_task_id already happened. I can widen to parsl_task_id OK because that implication has happened on the two log lines that already have an htex task ID. Is that ok in general? do I need fixpoints in general? something to keep an eye on. I think: as long as there is one record to convey the join as having happened, then a subsequent join can flesh that out. but if the join involves facts that aren't represented incrementally like that, then no. probably I can contrive some examples.h]hXRwidening submit_pass_id using key implication widening after loading/processing all the logs normally, which is what I’d expect if adding in ad-hoc hack stuff outside of the core parsl log loaders… has revealed some fixpoint related stuff: widening to htex_task_id which is the actual known ID isn’t sufficient because the widening of htex_task_id to parsl_task_id already happened. I can widen to parsl_task_id OK because that implication has happened on the two log lines that already have an htex task ID. Is that ok in general? do I need fixpoints in general? something to keep an eye on. I think: as long as there is one record to convey the join as having happened, then a subsequent join can flesh that out. but if the join involves facts that aren’t represented incrementally like that, then no. probably I can contrive some examples.}(hjPh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Muhj]h&hubeh}(h]?adventure-adding-observability-to-a-prototype-idris2interchangeah]h]Aadventure: adding observability to a prototype: idris2interchangeah]h!]uh%h*hj h&hh'h(h)Mubh+)}(hhh](h0)}(h4Performance measurement of patch stack on 2025-10-27h]h4Performance measurement of patch stack on 2025-10-27}(hjih&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjfh&hh'h(h)MyubjL)}(h_pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000h]h_pip install -e . && parsl-perf --config parsl/tests/configs/htex_local.py --iterate=1,1,1,10000}hjwsbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^j_uh%jKh'h(h)M{hjfh&hubh)}(hSRunning parsl-perf with constant block sizes (to avoid queue length speed changes):h]hSRunning parsl-perf with constant block sizes (to avoid queue length speed changes):}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjfh&hubh)}(hAmaster branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tpsh]hAmaster branch (165fdc5bf663ab7fd0d3ea7c2d8d177b02d731c5) 1139 tps}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjfh&hubh)}(hmore-task-tied-logs: 1024h]hmore-task-tied-logs: 1024}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjfh&hubh definition_list)}(hhh]h definition_list_item)}(hLjson-wide-log-records: 537 - but without initializing the JSONHandler: 1122 h](h term)}(hjson-wide-log-records: 537h]hjson-wide-log-records: 537}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jh'h(h)Mhjubh definition)}(hhh]j)}(hhh]j)}(h/but without initializing the JSONHandler: 1122 h]h)}(h.but without initializing the JSONHandler: 1122h]h.but without initializing the JSONHandler: 1122}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhjubah}(h]h]h]h]h!]j-uh%jh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhjubeh}(h]h]h]h]h!]uh%jh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhjfh&hubh)}(h-end of branch with all changes up to now: 385h]h-end of branch with all changes up to now: 385}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjfh&hubeh}(h]4performance-measurement-of-patch-stack-on-2025-10-27ah]h]4performance measurement of patch stack on 2025-10-27ah]h!]uh%h*hj h&hh'h(h)Myubh+)}(hhh](h0)}(h4Idea: Parsl resource monitoring on a host-wide basish]h4Idea: Parsl resource monitoring on a host-wide basis}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP](hidea; host-wide monitoringindex-13hNtahuh%hh'h(h)Mhjh&hubh)}(hhh]h}(h]h]h]h]h!]hj8uh%hhjh&hh'h(h)Mubh)}(hXIgnore Parsl Monitoring per-task resource monitoring and do something else that generates similar observability records. This was always some disappointment with getting WQ resource monitoring into the Parsl monitoring database: what exists there that could be imported? Likewise, host-wide stuff doesn't fit well into the current Parsl Monitoring model but might fit better into an observability model.h]hXIgnore Parsl Monitoring per-task resource monitoring and do something else that generates similar observability records. This was always some disappointment with getting WQ resource monitoring into the Parsl monitoring database: what exists there that could be imported? Likewise, host-wide stuff doesn’t fit well into the current Parsl Monitoring model but might fit better into an observability model.}(hjCh&hh'Nh)Nubah}(h]j8ah]h]h]h!]uh%hh'h(h)Mhjh&hh}h}j8j:subeh}(h]3idea-parsl-resource-monitoring-on-a-host-wide-basisah]h]4idea: parsl resource monitoring on a host-wide basisah]h!]uh%h*hj h&hh'h(h)Mubh+)}(hhh](h0)}(hIdea: worker node dmesgh]hIdea: worker node dmesg}(hj^h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj[h&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP]((hidea; kernel eventsindex-14hNt(h OOM KillerjwhNt(hdmesgjwhNtehuh%hh'h(h)Mhj[h&hubh)}(hhh]h}(h]h]h]h]h!]hjwuh%hhj[h&hh'h(h)Mubh)}(hespecially for catching OOM Killer and other interesting kernel events that affect processes without giving user-expected stack traces.h]hespecially for catching OOM Killer and other interesting kernel events that affect processes without giving user-expected stack traces.}(hjh&hh'Nh)Nubah}(h]jwah]h]h]h!]uh%hh'h(h)Mhj[h&hh}h}jwj}subh)}(h3Is dmesg available to users on Aurora worker nodes?h]h3Is dmesg available to users on Aurora worker nodes?}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj[h&hubh)}(h?``mesg`` already outputs JSON, if run with the right parameter.h](jn)}(h``mesg``h]hmesg}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh7 already outputs JSON, if run with the right parameter.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj[h&hubh)}(h|That should be an hour to prototype alongside workers. Cross reference host-wide process monitoring as thematically related.h]h|That should be an hour to prototype alongside workers. Cross reference host-wide process monitoring as thematically related.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj[h&hubeh}(h]idea-worker-node-dmesgah]h]idea: worker node dmesgah]h!]uh%h*hj h&hh'h(h)Mubh+)}(hhh](h0)}(hIdea: automatic instrumentationh]hIdea: automatic instrumentation}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP]((h OpenTelemetryindex-15hNt(hidea; automatic instrumentationjhNtehuh%hh'h(h)Mhjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)Mubh)}(hoProjects like OpenTelemetry offer automatic instrumentation. that would be interesting to experiment with here.h]hoProjects like OpenTelemetry offer automatic instrumentation. that would be interesting to experiment with here.}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)Mhjh&hh}h}jjsubh)}(h .. _moving:h]h}(h]h]h]h]h!]hmovinguh%hh)Mhjh&hh'h(ubeh}(h]idea-automatic-instrumentationah]h]idea: automatic instrumentationah]h!]uh%h*hj h&hh'h(h)Mubeh}(h](generating-wide-recordsj eh]h](generating wide recordscreatingeh]h!]uh%h*hhh&hh'h(h)MTh}j(j sh}j j subh+)}(hhh](h0)}(hMoving wide records aroundh]hMoving wide records around}(hj0h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj-h&hh'h(h)Mubh)}(hEvents are generated in various places (for example, in application code running on HPC cluster worker nodes) and usually the user wants them somewhere else - for some kind of analysis, in the broad sense (:ref:`analysing`).h](hEvents are generated in various places (for example, in application code running on HPC cluster worker nodes) and usually the user wants them somewhere else - for some kind of analysis, in the broad sense (}(hj>h&hh'Nh)NubjQ)}(h:ref:`analysing`h]jV)}(hjHh]h analysing}(hjJh&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhjFubah}(h]h]h]h]h!]refdochO refdomainjTreftyperef refexplicitrefwarnjs analysinguh%jPh'h(h)Mhj>ubh).}(hj>h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj-h&hubh)}(hXConcrete mechanisms for moving events around to a place of analysis should not be baked into the architecture. realtime options, file based options, ... - that is an area for experimentation and this work should facilitate that rather than being prescriptive.h]hXConcrete mechanisms for moving events around to a place of analysis should not be baked into the architecture. realtime options, file based options, … - that is an area for experimentation and this work should facilitate that rather than being prescriptive.}(hjph&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj-h&hubh)}(hThis chapter talks about different mechanisms that are on offer, and about how configuration might work. It tries to stay away from implementing much new mechanism, but rather tries to focus on integration of what already exists.h]hThis chapter talks about different mechanisms that are on offer, and about how configuration might work. It tries to stay away from implementing much new mechanism, but rather tries to focus on integration of what already exists.}(hj~h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj-h&hubh+)}(hhh](h0)}(hComparison to Parsl loggingh]hComparison to Parsl logging}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(h+also followed by Academy at time of writingh]h+also followed by Academy at time of writing}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hXVexpectation is you send your debugging expert a tarball of logs for them to pore over - this is extremely asynchronous but a very effective way of moving those records around. it has relatively low effect on performance behaviour: get the logs onto some filesystem while the performance-critical bit is running, move them from there later on.h]hXVexpectation is you send your debugging expert a tarball of logs for them to pore over - this is extremely asynchronous but a very effective way of moving those records around. it has relatively low effect on performance behaviour: get the logs onto some filesystem while the performance-critical bit is running, move them from there later on.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubj)}(hhh]j)}(hXporing over these logs "later" - there's no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, "send me a tarball of your runinfo" is a standard technique. h]h)}(hXporing over these logs "later" - there's no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, "send me a tarball of your runinfo" is a standard technique.h]hXporing over these logs “later” - there’s no need for those logs to accumulate in real time in one place for post-facto analysis. And in practice, when doing log analysis rather than monitoring analysis, “send me a tarball of your runinfo” is a standard technique.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhjh&hubah}(h]h]h]h]h!]jjuh%jh'h(h)Mhjh&hubh)}(h@async movement is much easier than synchronous/realtime movementh]h@async movement is much easier than synchronous/realtime movement}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubeh}(h]comparison-to-parsl-loggingah]h]comparison to parsl loggingah]h!]uh%h*hj-h&hh'h(h)Mubh+)}(hhh](h0)}(hComparison to Parsl Monitoringh]hComparison to Parsl Monitoring}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hThe transmission model is real-time. Even with recent radio plugins, the assumption is still that messages will arrive soon after being sent.h]hThe transmission model is real-time. Even with recent radio plugins, the assumption is still that messages will arrive soon after being sent.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hXThe almost-real-time data transmisison model is especially awkward when combined with SQL: distributed system events will arrive at different times or in the original UDP model perhaps not at all, and the "first" message that creates a task (for the purposes of the database) might arrive after some secondary data that requires that primary key to exist. yes, it's nice for the SQL database to follow foreign key rules, especially when looking at the data "afterwards" but that's not realistic for distributed unreliable events.h]hXThe almost-real-time data transmisison model is especially awkward when combined with SQL: distributed system events will arrive at different times or in the original UDP model perhaps not at all, and the “first” message that creates a task (for the purposes of the database) might arrive after some secondary data that requires that primary key to exist. yes, it’s nice for the SQL database to follow foreign key rules, especially when looking at the data “afterwards” but that’s not realistic for distributed unreliable events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hh](hDUDP: sends UDP packets at submit side. UDP is unreliable. So events }(hj@h&hh'Nh)Nubh)}(h*do*h]hdo}(hjGh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhj@ubh get lost. I think the assumption at implementation time was that UDP packet loss is just some thing your professor tells you about, but clearly doesn’t happen.}(hj@h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj<ubah}(h]h]h]h]h!]uh%jh'h(h)Mhj9h&hubj)}(hfilesystem: needs a shared filesystem. One file = one monitoring event. If your filesystem is slow, which it often is, this is slow too.h]h)}(hjgh]hfilesystem: needs a shared filesystem. One file = one monitoring event. If your filesystem is slow, which it often is, this is slow too.}(hjih&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjeubah}(h]h]h]h]h!]uh%jh'h(h)Mhj9h&hubj)}(hHTEX: the most efficient, but only works with the High Throughput Executor. Uses the existing HTEX result channel to send back monitoring events.h]h)}(hj~h]hHTEX: the most efficient, but only works with the High Throughput Executor. Uses the existing HTEX result channel to send back monitoring events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj|ubah}(h]h]h]h]h!]uh%jh'h(h)Mhj9h&hubj)}(hXZMQ: This is over TCP. Like UDP radio, needs to be able to connect to the submit side. Probably better than UDP, although there's a TCP and ZMQ session initialization needed at the start of every task, because this radio does not persist connections across tasks. Unlike UDP, on the submit side, is yet another per-worker file descriptor use, I think, which is a serious scalability limitation.h]h)}(hjh]hXZMQ: This is over TCP. Like UDP radio, needs to be able to connect to the submit side. Probably better than UDP, although there’s a TCP and ZMQ session initialization needed at the start of every task, because this radio does not persist connections across tasks. Unlike UDP, on the submit side, is yet another per-worker file descriptor use, I think, which is a serious scalability limitation.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhj9h&hubj)}(hX Python multiprocessing: for sending monitoring events within the same cluster of Python `multiprocessing` proceses: roughly the set of processes that were forked locally by the submit script using `multiprocessing`, so in htex: not the workers and not the interchange. h]h)}(hX Python multiprocessing: for sending monitoring events within the same cluster of Python `multiprocessing` proceses: roughly the set of processes that were forked locally by the submit script using `multiprocessing`, so in htex: not the workers and not the interchange.h](hXPython multiprocessing: for sending monitoring events within the same cluster of Python }(hjh&hh'Nh)Nubj )}(h`multiprocessing`h]hmultiprocessing}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubh\ proceses: roughly the set of processes that were forked locally by the submit script using }(hjh&hh'Nh)Nubj )}(h`multiprocessing`h]hmultiprocessing}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubh6, so in htex: not the workers and not the interchange.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhj9h&hubeh}(h]h]h]h]h!]jjuh%jh'h(h)Mhjh&hubh)}(hNone of these are suitable for cloud-style environments where there is neither a shared filesystem or clean IP network. So I also prototyped an Academy radio for use with GC+Parsl - although I would rather use something like Octopus or Chronolog.h]hNone of these are suitable for cloud-style environments where there is neither a shared filesystem or clean IP network. So I also prototyped an Academy radio for use with GC+Parsl - although I would rather use something like Octopus or Chronolog.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubeh}(h]comparison-to-parsl-monitoringah]h]comparison to parsl monitoringah]h!]uh%h*hj-h&hh'h(h)Mubh+)}(hhh](h0)}(hPython Configurabilityh]hPython Configurability}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP]((hConfigurabilityindex-16hNt(hPython; ConfigurabilityjhNtehuh%hh'h(h)Mhjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)Mubh)}(hA soft start in Parsl is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn't matter performance-wise too much about adding in the extra stuff on log calls.h]hA soft start in Parsl is to let people opt into observability style logs - with most performance hit coming from turning on json output, i think, it doesn’t matter performance-wise too much about adding in the extra stuff on log calls.}(hj+h&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)Mhjh&hh}h}jj"subh)}(hXThe current parsl stuff is not set up for arbitrary log configuration outside of the submit-side process: for example, the worker helpers don't do any log config at all and rely on their enclosing per-executor environments to do it, which i think some do not.h]hXThe current parsl stuff is not set up for arbitrary log configuration outside of the submit-side process: for example, the worker helpers don’t do any log config at all and rely on their enclosing per-executor environments to do it, which i think some do not.}(hj;h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hYhtex interchange and worker logs have a hardcoded log config with a single debug boolean.h]hYhtex interchange and worker logs have a hardcoded log config with a single debug boolean.}(hjIh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hI'd like to do something a bit more flexible than adding more parameters, that reflect that in the future people might want to configure their handlers differently rather than using the JSONHandler.h]hI’d like to do something a bit more flexible than adding more parameters, that reflect that in the future people might want to configure their handlers differently rather than using the JSONHandler.}(hjWh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hhh]h}(h]h]h]h]h!]hP](h Chronologindex-17hNtahuh%hh'h(h)Mhjh&hubh)}(hhh]h}(h]h]h]h]h!]hjpuh%hhjh&hh'h(h)Mubh)}(h;eg. chronolog. pytest metrics observation in other section.h]h;eg. chronolog. pytest metrics observation in other section.}(hj{h&hh'Nh)Nubah}(h]jpah]h]h]h!]uh%hh'h(h)Mhjh&hh}h}jpjrsubh)}(hsee Parsl monitoring radios configuration model. start prototyping that. note that it doesn't magically make arbitrary components that aren't compliant+Python redirectable. but thats fine in the modular approach.h]hsee Parsl monitoring radios configuration model. start prototyping that. note that it doesn’t magically make arbitrary components that aren’t compliant+Python redirectable. but thats fine in the modular approach.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hsee the existing initialize_logging which allows arbitrary user configurability at the submit-process side, by getting parsl completely out of the way and allowing the user to run whatever code they want.h]hsee the existing initialize_logging which allows arbitrary user configurability at the submit-process side, by getting parsl completely out of the way and allowing the user to run whatever code they want.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubeh}(h]python-configurabilityah]h]python configurabilityah]h!]uh%h*hj-h&hh'h(h)Mubh+)}(hhh](h0)}(h/Adventure: Wide records stored as JSON in filesh]h/Adventure: Wide records stored as JSON in files}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hvThis prototype stores Parsl logs that have been sent into the Python ``logging`` system as JSON objects, one per line.h](hEThis prototype stores Parsl logs that have been sent into the Python }(hjh&hh'Nh)Nubjn)}(h ``logging``h]hlogging}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh& system as JSON objects, one per line.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M hjh&hubh)}(h>This is one of the initial usecases for above configurability.h]h>This is one of the initial usecases for above configurability.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hjh&hubh)}(hThis was implemented is a straightforward Python logging `Handler` similar to the existing log handlers, the difference being how the output line is formatted.h](h9This was implemented is a straightforward Python logging }(hjh&hh'Nh)Nubj )}(h `Handler`h]hHandler}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubh] similar to the existing log handlers, the difference being how the output line is formatted.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hxThe files are then moveable using traditional means: for exmaple, the classic "send me a tarball of your run directory".h]h|The files are then moveable using traditional means: for exmaple, the classic “send me a tarball of your run directory”.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubeh}(h].adventure-wide-records-stored-as-json-in-filesah]h]/adventure: wide records stored as json in filesah]h!]uh%h*hj-h&hh'h(h)Mubh+)}(hhh](h0)}(hMoving in realtimeh]hMoving in realtime}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj$h&hh'h(h)Mubh)}(hWhat does realtime mean in this case? mostly a case of what do the ultimate consumers need, rather than any strong technical definition at this stage.h]hWhat does realtime mean in this case? mostly a case of what do the ultimate consumers need, rather than any strong technical definition at this stage.}(hj5h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hInside Python parts of Parsl, this data *is* available in realtime at the point of logging as it goes to whatever LogHandler is running in each python process. that isn't true in general on the "event model" side of things, though.h](h(Inside Python parts of Parsl, this data }(hjCh&hh'Nh)Nubh)}(h*is*h]his}(hjKh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjCubh available in realtime at the point of logging as it goes to whatever LogHandler is running in each python process. that isn’t true in general on the “event model” side of things, though.}(hjCh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hxParsl already moves some event stuff around the network in realtime: that is the purpose of the monitoring radio system.h]hxParsl already moves some event stuff around the network in realtime: that is the purpose of the monitoring radio system.}(hjch&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hJfollowing two sections, octopus and chronolog, will talk about doing that.h]hJfollowing two sections, octopus and chronolog, will talk about doing that.}(hjqh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hMy initial log-related work was post-facto: copy log files around. but there are plenty of mechanisms that should be able to deliver and analyse live, eg built around Diaspora Octopush]hMy initial log-related work was post-facto: copy log files around. but there are plenty of mechanisms that should be able to deliver and analyse live, eg built around Diaspora Octopus}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hj$h&hubeh}(h]moving-in-realtimeah]h]moving in realtimeah]h!]uh%h*hj-h&hh'h(h)Mubh+)}(hhh](h0)}(hAdventure: Diaspora Octopush]hAdventure: Diaspora Octopus}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)M$ubh)}(hhh]h}(h]h]h]h]h!]hP]((hDiaspora; Octopusindex-18hNt(hOctopusjhNt(hKafkajhNt(h Globus Hosted Services; DiasporajhNtehuh%hh'h(h)M&hjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)M*ubh)}(hiThis is an obvious follow-on to file-based JSON logs: the developers still kinda exist, and are friendly.h]hiThis is an obvious follow-on to file-based JSON logs: the developers still kinda exist, and are friendly.}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)M+hjh&hh}h}jjsubh)}(hhh]h}(h]h]h]h]h!]hP]((h people; Ryanindex-19hNt(hpeople; HaochenjhNtehuh%hh'h(h)M-hjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)M/ubh)}(hwith Ryan and Haochenh]hwith Ryan and Haochen}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)M0hjh&hh}h}jjsubh)}(hZThis turned into a monster debugging and restructuring session around Octopus reliability.h]hZThis turned into a monster debugging and restructuring session around Octopus reliability.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M2hjh&hubh)}(hRRyan has a specific use case he's trying to implement, that I am helping him with:h]hTRyan has a specific use case he’s trying to implement, that I am helping him with:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M4hjh&hubj)}(hi mostly want to know when my agents perform their loop so i can hackily use this as a heartbeat to determine if my agents alive, and, when the agent decides to call the llm, i want to know the outcome of that call -- Ryan, on Slackh](h)}(hi mostly want to know when my agents perform their loop so i can hackily use this as a heartbeat to determine if my agents alive, and, when the agent decides to call the llm, i want to know the outcome of that callh]hi mostly want to know when my agents perform their loop so i can hackily use this as a heartbeat to determine if my agents alive, and, when the agent decides to call the llm, i want to know the outcome of that call}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M9hjubh attribution)}(hRyan, on Slackh]hRyan, on Slack}(hj*h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j(h'h(h)M;hjubeh}(h]h]epigraphah]h]h!]uh%jh'h(h)M9hjh&hubeh}(h]adventure-diaspora-octopusah]h]adventure: diaspora octopusah]h!]uh%h*hj-h&hh'h(h)M$ubh+)}(hhh](h0)}(hIdea: Chronologh]hIdea: Chronolog}(hjJh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjGh&hh'h(h)M>ubh)}(hhh]h}(h]h]h]h]h!]hP]((h Chronologindex-20hNt(hpeople; NishchayjchNt(hidea; ChronologjchNtehuh%hh'h(h)M@hjGh&hubh)}(hhh]h}(h]h]h]h]h!]hjcuh%hhjGh&hh'h(h)MCubh)}(h=Nishchay did some stuff here. I don't know what the state is.h]h?Nishchay did some stuff here. I don’t know what the state is.}(hjrh&hh'Nh)Nubah}(h]jcah]h]h]h!]uh%hh'h(h)MDhjGh&hh}h}jcjisubh)}(h0https://grc.iit.edu/research/projects/chronolog/h]h)}(hjh]h0https://grc.iit.edu/research/projects/chronolog/}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]refurijuh%hhjubah}(h]h]h]h]h!]uh%hh'h(h)MFhjGh&hubh)}(h.. _pytest-observes-logs:h]h}(h]h]h]h]h!]hpytest-observes-logsuh%hh)MIhjGh&hh'h(ubeh}(h]idea-chronologah]h]idea: chronologah]h!]uh%h*hj-h&hh'h(h)M>ubh+)}(hhh](h0)}(h1Adventure: pytest observing interchange variablesh]h1Adventure: pytest observing interchange variables}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)MLubh)}(hhh]h}(h]h]h]h]h!]hP]((hpytestindex-21hNt(hPython; pytestjhNtehuh%hh'h(h)MNhjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)MPubh)}(hXKpytest htex task priority test wants to wait for interchange to have all the submitted tasks - which happens asynchronously to submit calls returning. it does that by logfile parsing. how does that fit into this observability story: there's a metric in my prototype for this value (which I used in one of the other use cases here).h]hXMpytest htex task priority test wants to wait for interchange to have all the submitted tasks - which happens asynchronously to submit calls returning. it does that by logfile parsing. how does that fit into this observability story: there’s a metric in my prototype for this value (which I used in one of the other use cases here).}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)MQhjh&hh}h}jjsubh)}(hXYCan do this by re-parsing the interchange log value. also could (with suitable configuration) attach a "pytest can see only metrics" log writer that runs over a unix socket? in some sense, injecting the relevant observability path into the interchange code as a configured log handler. that gives some motivation for the configurability section.h]hX]Can do this by re-parsing the interchange log value. also could (with suitable configuration) attach a “pytest can see only metrics” log writer that runs over a unix socket? in some sense, injecting the relevant observability path into the interchange code as a configured log handler. that gives some motivation for the configurability section.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MXhjh&hubh)}(hAlso attaching a JSON log file to the interchange, and having a tail reader of that. also needs special configuration of interchange I think.h]hAlso attaching a JSON log file to the interchange, and having a tail reader of that. also needs special configuration of interchange I think.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M_hjh&hubeh}(h](0adventure-pytest-observing-interchange-variablesjeh]h](1adventure: pytest observing interchange variablespytest-observes-logseh]h!]uh%h*hj-h&hh'h(h)MLh}jjsh}jjsubh+)}(hhh](h0)}(hGAdventure: Academy agents can report their own relevant logs via actionh]hGAdventure: Academy agents can report their own relevant logs via action}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj h&hh'h(h)Mdubh)}(h6A prototype I made for Logan, and also showed to Ryan.h]h6A prototype I made for Logan, and also showed to Ryan.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mfhj h&hubh)}(hZThis ties in with Ryan's Diaspora use case for examining what individual agents are up to.h]h\This ties in with Ryan’s Diaspora use case for examining what individual agents are up to.}(hj,h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhhj h&hubh)}(h.. _analysing:h]h}(h]h]h]h]h!]h analysinguh%hh)Mjhj h&hh'h(ubeh}(h]Fadventure-academy-agents-can-report-their-own-relevant-logs-via-actionah]h]Gadventure: academy agents can report their own relevant logs via actionah]h!]uh%h*hj-h&hh'h(h)Mdubeh}(h](moving-wide-records-aroundjeh]h](moving wide records aroundmovingeh]h!]uh%h*hhh&hh'h(h)Mh}jSjsh}jjsubh+)}(hhh](h0)}(hAnalysing wide recordsh]hAnalysing wide records}(hj[h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjXh&hh'h(h)Mmubh)}(hXAt small enough scale, which is actually quite a large number of tasks, given the volumes of data involved, parsing logs into a Python session and using standard Python tools like list comprehensions is a legitimate way to analyse things - rather than treating this approach as something awkwardly shameful that will be replaced by The Real Thing later. This is especially appealing to Parsl users who tend to be Python data science literate anyway.h]hXAt small enough scale, which is actually quite a large number of tasks, given the volumes of data involved, parsing logs into a Python session and using standard Python tools like list comprehensions is a legitimate way to analyse things - rather than treating this approach as something awkwardly shameful that will be replaced by The Real Thing later. This is especially appealing to Parsl users who tend to be Python data science literate anyway.}(hjih&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MohjXh&hubh)}(hThat isn't the only approach and this is also modular: you get the records and can analyse them using whatever tools you personally find appropriate.h]hThat isn’t the only approach and this is also modular: you get the records and can analyse them using whatever tools you personally find appropriate.}(hjwh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MqhjXh&hubh+)}(hhh](h0)}(h>Adventure: All events for a task, in two aspects/presentationsh]h>Adventure: All events for a task, in two aspects/presentations}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mtubh)}(hhh]h}(h]h]h]h]h!]hP]((haspectindex-22hNt(h presentationjhNtehuh%hh'h(h)Mvhjh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjh&hh'h(h)Mxubh)}(hTODO: move this from elsewhere, tidyup/modularise the code so it is presentable as a good first example. Show what it looks like: with monitoring.db import only; with full observability prototype.h]hTODO: move this from elsewhere, tidyup/modularise the code so it is presentable as a good first example. Show what it looks like: with monitoring.db import only; with full observability prototype.}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)Myhjh&hh}h}jjsubh)}(hOemphasise: the first one is available now without needing to modify Parsl core.h]hOemphasise: the first one is available now without needing to modify Parsl core.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M|hjh&hubh)}(hemphasise: the same analytics code gives different results without needing much modification, given different *aspects* / *presentations* of the same run - see :ref:`partialdata`h](hnemphasise: the same analytics code gives different results without needing much modification, given different }(hjh&hh'Nh)Nubh)}(h *aspects*h]haspects}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh / }(hjh&hh'Nh)Nubh)}(h*presentations*h]h presentations}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh of the same run - see }(hjh&hh'Nh)NubjQ)}(h:ref:`partialdata`h]jV)}(hjh]h partialdata}(hjh&hh'Nh)Nubah}(h]h](jastdstd-refeh]h]h!]uh%hhjubah}(h]h]h]h]h!]refdochO refdomainjreftyperef refexplicitrefwarnjs partialdatauh%jPh'h(h)M~hjubeh}(h]h]h]h]h!]uh%hh'h(h)M~hjh&hubh)}(hemphasise: integration: there are resource records for tasks via monitoring, and htex internals via JSON logs: neither is a superset of the other.h]hemphasise: integration: there are resource records for tasks via monitoring, and htex internals via JSON logs: neither is a superset of the other.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh+)}(hhh](h0)}(h/Task events from the monitoring.db presentationh]h/Task events from the monitoring.db presentation}(hj/h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj,h&hh'h(h)Mubh)}(hfImport the monitoring.db created by unmodified master Parsl, and see how many event records there are:h]hfImport the monitoring.db created by unmodified master Parsl, and see how many event records there are:}(hj=h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubjL)}(h|>>> from parsl.observability.import_monitoring_db import import_db >>> l=import_db("runinfo/monitoring.db") >>> len(l) 14596h]h|>>> from parsl.observability.import_monitoring_db import import_db >>> l=import_db("runinfo/monitoring.db") >>> len(l) 14596}hjKsbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^python3uh%jKh'h(h)Mhj,h&hubh)}(hHere's an example event record:h]h!Here’s an example event record:}(hj]h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubjL)}(h>>> l[0] {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 0, 'parsl_try_id': 0, 'parsl_task_status': 'pending', 'created': 1764589563.205463}h]h>>> l[0] {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 0, 'parsl_try_id': 0, 'parsl_task_status': 'pending', 'created': 1764589563.205463}}hjksbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^python3uh%jKh'h(h)Mhj,h&hubh)}(hKNow identify all the tasks, keyed hierarchically by DFK ID and task number:h]hKNow identify all the tasks, keyed hierarchically by DFK ID and task number:}(hj}h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubjL)}(h>>> tasks = { (event['parsl_dfk'], event['parsl_task_id']) for event in l if 'parsl_dfk' in event and 'parsl_task_id' in event } >>> len(tasks) 2432h]h>>> tasks = { (event['parsl_dfk'], event['parsl_task_id']) for event in l if 'parsl_dfk' in event and 'parsl_task_id' in event } >>> len(tasks) 2432}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^python3uh%jKh'h(h)Mhj,h&hubh)}(hand pick one randomly:h]hand pick one randomly:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubjL)}(h]>>> import random >>> random.choice(list(tasks)) ('a08cc383-927a-4ce8-926b-f31e52e6edc2', 72)h]h]>>> import random >>> random.choice(list(tasks)) ('a08cc383-927a-4ce8-926b-f31e52e6edc2', 72)}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^python3uh%jKh'h(h)Mhj,h&hubh)}(h?That's task 72 of run ``a08cc383-927a-4ce8-926b-f31e52e6edc2``.h](hThat’s task 72 of run }(hjh&hh'Nh)Nubjn)}(h(``a08cc383-927a-4ce8-926b-f31e52e6edc2``h]h$a08cc383-927a-4ce8-926b-f31e52e6edc2}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubh)}(h[Now pick out all the records that are labelled as part of that task, and print them nicely:h]h[Now pick out all the records that are labelled as part of that task, and print them nicely:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubjL)}(hX' >>> events = [event for event in l if event.get('parsl_dfk', None) == 'a08cc383-927a-4ce8-926b-f31e52e6edc2' and event.get('parsl_task_id', None) == 72] >>> for event in sorted(events, key=lambda event: float(event['created'])): print(event) ... {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'pending', 'created': 1764589576.775231} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'launched', 'created': 1764589577.435498} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'running', 'created': 1764589577.472009} {'parsl_try_id': 0, 'parsl_task_id': 72, 'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'created': 1764589577.514019, 'resource_monitoring_interval': 1.0, 'psutil_process_pid': 10680, 'psutil_process_memory_percent': 1.2360561455881223, 'psutil_process_children_count': 1.0, 'psutil_process_time_user': 1.26, 'psutil_process_time_system': 0.21000000000000002, 'psutil_process_memory_virtual': 416145408.0, 'psutil_process_memory_resident': 203653120.0, 'psutil_process_disk_read': 32585809.0, 'psutil_process_disk_write': 20895.0, 'psutil_process_status': 'sleeping', 'psutil_cpu_num': '3', 'psutil_process_num_ctx_switches_voluntary': 59.0, 'psutil_process_num_ctx_switches_involuntary': 396.0} {'parsl_try_id': 0, 'parsl_task_id': 72, 'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'created': 1764589577.566039, 'resource_monitoring_interval': 1.0, 'psutil_process_pid': 10680, 'psutil_process_memory_percent': 1.2366030730861701, 'psutil_process_children_count': 1.0, 'psutil_process_time_user': 1.28, 'psutil_process_time_system': 0.22, 'psutil_process_memory_virtual': 416145408.0, 'psutil_process_memory_resident': 203661312.0, 'psutil_process_disk_read': 32995622.0, 'psutil_process_disk_write': 26441.0, 'psutil_process_status': 'sleeping', 'psutil_cpu_num': '3', 'psutil_process_num_ctx_switches_voluntary': 69.0, 'psutil_process_num_ctx_switches_involuntary': 454.0} {'parsl_try_id': 0, 'parsl_task_id': 72, 'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'created': 1764589577.600372, 'resource_monitoring_interval': 1.0, 'psutil_process_pid': 10680, 'psutil_process_memory_percent': 1.2366030730861701, 'psutil_process_children_count': 1.0, 'psutil_process_time_user': 1.3, 'psutil_process_time_system': 0.23, 'psutil_process_memory_virtual': 416145408.0, 'psutil_process_memory_resident': 203444224.0, 'psutil_process_disk_read': 33404930.0, 'psutil_process_disk_write': 33922.0, 'psutil_process_status': 'sleeping', 'psutil_cpu_num': '3', 'psutil_process_num_ctx_switches_voluntary': 74.0, 'psutil_process_num_ctx_switches_involuntary': 465.0} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'running_ended', 'created': 1764589577.646802} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'exec_done', 'created': 1764589577.671692}h]hX' >>> events = [event for event in l if event.get('parsl_dfk', None) == 'a08cc383-927a-4ce8-926b-f31e52e6edc2' and event.get('parsl_task_id', None) == 72] >>> for event in sorted(events, key=lambda event: float(event['created'])): print(event) ... {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'pending', 'created': 1764589576.775231} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'launched', 'created': 1764589577.435498} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'running', 'created': 1764589577.472009} {'parsl_try_id': 0, 'parsl_task_id': 72, 'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'created': 1764589577.514019, 'resource_monitoring_interval': 1.0, 'psutil_process_pid': 10680, 'psutil_process_memory_percent': 1.2360561455881223, 'psutil_process_children_count': 1.0, 'psutil_process_time_user': 1.26, 'psutil_process_time_system': 0.21000000000000002, 'psutil_process_memory_virtual': 416145408.0, 'psutil_process_memory_resident': 203653120.0, 'psutil_process_disk_read': 32585809.0, 'psutil_process_disk_write': 20895.0, 'psutil_process_status': 'sleeping', 'psutil_cpu_num': '3', 'psutil_process_num_ctx_switches_voluntary': 59.0, 'psutil_process_num_ctx_switches_involuntary': 396.0} {'parsl_try_id': 0, 'parsl_task_id': 72, 'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'created': 1764589577.566039, 'resource_monitoring_interval': 1.0, 'psutil_process_pid': 10680, 'psutil_process_memory_percent': 1.2366030730861701, 'psutil_process_children_count': 1.0, 'psutil_process_time_user': 1.28, 'psutil_process_time_system': 0.22, 'psutil_process_memory_virtual': 416145408.0, 'psutil_process_memory_resident': 203661312.0, 'psutil_process_disk_read': 32995622.0, 'psutil_process_disk_write': 26441.0, 'psutil_process_status': 'sleeping', 'psutil_cpu_num': '3', 'psutil_process_num_ctx_switches_voluntary': 69.0, 'psutil_process_num_ctx_switches_involuntary': 454.0} {'parsl_try_id': 0, 'parsl_task_id': 72, 'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'created': 1764589577.600372, 'resource_monitoring_interval': 1.0, 'psutil_process_pid': 10680, 'psutil_process_memory_percent': 1.2366030730861701, 'psutil_process_children_count': 1.0, 'psutil_process_time_user': 1.3, 'psutil_process_time_system': 0.23, 'psutil_process_memory_virtual': 416145408.0, 'psutil_process_memory_resident': 203444224.0, 'psutil_process_disk_read': 33404930.0, 'psutil_process_disk_write': 33922.0, 'psutil_process_status': 'sleeping', 'psutil_cpu_num': '3', 'psutil_process_num_ctx_switches_voluntary': 74.0, 'psutil_process_num_ctx_switches_involuntary': 465.0} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'running_ended', 'created': 1764589577.646802} {'parsl_dfk': 'a08cc383-927a-4ce8-926b-f31e52e6edc2', 'parsl_task_id': 72, 'parsl_try_id': 0, 'parsl_task_status': 'exec_done', 'created': 1764589577.671692}}hjsbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^python3uh%jKh'h(h)Mhj,h&hubh)}(hXSo what comes out is some records which reflect the change in the task status as seen by the monitoring system (Which includes ``running`` and ``running_ended`` status that isn't known to the DFK) and while the task is running, some resource monitoring records.h](hSo what comes out is some records which reflect the change in the task status as seen by the monitoring system (Which includes }(hjh&hh'Nh)Nubjn)}(h ``running``h]hrunning}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubh and }(hjh&hh'Nh)Nubjn)}(h``running_ended``h]h running_ended}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjubhg status that isn’t known to the DFK) and while the task is running, some resource monitoring records.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubh)}(hThis is basically a reformatting of records you could get by running SQL queries against ``monitoring.db``, which is unsurprising: the only data source used was that database file.h](hYThis is basically a reformatting of records you could get by running SQL queries against }(hj/h&hh'Nh)Nubjn)}(h``monitoring.db``h]h monitoring.db}(hj7h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj/ubhJ, which is unsurprising: the only data source used was that database file.}(hj/h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj,h&hubeh}(h]/task-events-from-the-monitoring-db-presentationah]h]/task events from the monitoring.db presentationah]h!]uh%h*hjh&hh'h(h)Mubh+)}(hhh](h0)}(h,Task events from a JSON logging presentationh]h,Task events from a JSON logging presentation}(hjZh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjWh&hh'h(h)Mubh)}(hXI'm going to look at that same task (task 72 of run ``a08cc383-927a-4ce8-926b-f31e52e6edc2``) from the perspective of JSON logs now -- Parsl modified to output richer log files in JSON format with more machine readable metadata. TODO: reference the relevant generating events section.h](h6I’m going to look at that same task (task 72 of run }(hjhh&hh'Nh)Nubjn)}(h(``a08cc383-927a-4ce8-926b-f31e52e6edc2``h]h$a08cc383-927a-4ce8-926b-f31e52e6edc2}(hjph&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhjhubh) from the perspective of JSON logs now – Parsl modified to output richer log files in JSON format with more machine readable metadata. TODO: reference the relevant generating events section.}(hjhh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)MhjWh&hubh)}(hAll I am going to change is the importer command. I'm going to use the same selector and printing code shown above. So what we should get is a different *presentation* or *aspect* of the same task.h](hAll I am going to change is the importer command. I’m going to use the same selector and printing code shown above. So what we should get is a different }(hjh&hh'Nh)Nubh)}(h*presentation*h]h presentation}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh or }(hjh&hh'Nh)Nubh)}(h*aspect*h]haspect}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubh of the same task.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)MhjWh&hubeh}(h],task-events-from-a-json-logging-presentationah]h],task events from a json logging presentationah]h!]uh%h*hjh&hh'h(h)Mubeh}(h]adventure: all events for a task, in two aspects/presentationsah]h!]uh%h*hjXh&hh'h(h)Mtubh+)}(hhh](h0)}(hWAdventure: The minimal change necessary to get htex task logs into the above task traceh]hWAdventure: The minimal change necessary to get htex task logs into the above task trace}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hOThis is probably one log line? To perform the join between IDs? Plus JSON logs.h]hOThis is probably one log line? To perform the join between IDs? Plus JSON logs.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hKTODO: present the changes I've done, but minimised to only this one change.h]hMTODO: present the changes I’ve done, but minimised to only this one change.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubeh}(h]Vadventure-the-minimal-change-necessary-to-get-htex-task-logs-into-the-above-task-traceah]h]Wadventure: the minimal change necessary to get htex task logs into the above task traceah]h!]uh%h*hjXh&hh'h(h)Mubh+)}(hhh](h0)}(h6Adventure: blog: Visualization for task prioritisationh]h6Adventure: blog: Visualization for task prioritisation}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hO(two graphs that are already in parsl-visualize but probably-buggy - see #4021)h]hO(two graphs that are already in parsl-visualize but probably-buggy - see #4021)}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(hthis uses replay-monitoring.db approach with no runtime changes. because the work I did there was in parsl master, but I want to do custom visualizations.h]hthis uses replay-monitoring.db approach with no runtime changes. because the work I did there was in parsl master, but I want to do custom visualizations.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh)}(h[TODO: link to blog post]h]h[TODO: link to blog post]}(hj,h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjh&hubh+)}(hhh](h0)}(h#prioritisation part 2: by task typeh]h#prioritisation part 2: by task type}(hj=h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj:h&hh'h(h)Mubh)}(hOwork towards a second blog post here. now most of the mechanics are worked out.h]hOwork towards a second blog post here. now most of the mechanics are worked out.}(hjKh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hJStep 2 of that: This was a second requirement on prioritisation from DESC.h]hJStep 2 of that: This was a second requirement on prioritisation from DESC.}(hjYh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hIuse an A->B1/B2->C three step diamond-dag because its a bit less trivial.h]hIuse an A->B1/B2->C three step diamond-dag because its a bit less trivial.}(hjgh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hvisualization of task types for jim's follow on question: how can we adapt step 1 to colour by app name? It's not well presented in parsl-visualize because that focuses on state transitions rather than on app identity as the primary colour-key.h]hvisualization of task types for jim’s follow on question: how can we adapt step 1 to colour by app name? It’s not well presented in parsl-visualize because that focuses on state transitions rather than on app identity as the primary colour-key.}(hjuh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(h]Visualisation also coloured by task-chain/task-cluster to show a cluster based visualization.h]h]Visualisation also coloured by task-chain/task-cluster to show a cluster based visualization.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hXpriority modes: natural (submit-to-htex order, "as unblocked" order), random (priority=random.random()), chain priority by chain depth, chain priority by cluster. the last two should be "the same" in plot 4 i hope. unclear what random mode will do, if anything? i guess get more later-unlocked tasks randomly in there? random is always interesting to me as pushing things away from degenerate cases - i this case "Cs run last"h]hXpriority modes: natural (submit-to-htex order, “as unblocked” order), random (priority=random.random()), chain priority by chain depth, chain priority by cluster. the last two should be “the same” in plot 4 i hope. unclear what random mode will do, if anything? i guess get more later-unlocked tasks randomly in there? random is always interesting to me as pushing things away from degenerate cases - i this case “Cs run last”}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hKplot 1: task run/running_ended individual tasks, coloured by parsl app nameh]hKplot 1: task run/running_ended individual tasks, coloured by parsl app name}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(h>plot 2: tasks of each of two kinds, coloured by parsl app nameh]h>plot 2: tasks of each of two kinds, coloured by parsl app name}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hUplot 3: tasks running by type, with no priority, with two different priority schemes.h]hUplot 3: tasks running by type, with no priority, with two different priority schemes.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hplot 4: Visualisation of end-result completed - i.e. how many C tasks have completed over time, ignoring everything else about the inside. with prioritisation and with my two prioritisation schemes.h]hplot 4: Visualisation of end-result completed - i.e. how many C tasks have completed over time, ignoring everything else about the inside. with prioritisation and with my two prioritisation schemes.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hPlot 4 should be the top level plot set - because it an example "goal" of the prioritisation, I think. (might be because you want results sooner, might be because C completing means you can delete a load of intermediate temporary data sooner).h]hPlot 4 should be the top level plot set - because it an example “goal” of the prioritisation, I think. (might be because you want results sooner, might be because C completing means you can delete a load of intermediate temporary data sooner).}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh)}(hXhFrom an observability perspective: the task chain identity is not known to Parsl. this is additional metadata, that in observability concepts, is added on by a "higher level system" and joined on at analysis time. the application knows about it, and the querier knows about it. none of the intermediate execution or observability infrastructure knows about it.h]hXlFrom an observability perspective: the task chain identity is not known to Parsl. this is additional metadata, that in observability concepts, is added on by a “higher level system” and joined on at analysis time. the application knows about it, and the querier knows about it. none of the intermediate execution or observability infrastructure knows about it.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj:h&hubh enumerated_list)}(hhh]j)}(hXthe status table rerun gives runtimes for plotting based on Parsl level dfk/task/try but doesn't give any metadata about those. such as app name. in SQL this is added on as a JOIN, and so it is here too - rerun the tasks table as a sequence of log records - note that they don't have a notion of "created" here because they are records but aren't from a point in time, instead an already aggregated set of information. don't let that scare you. observability records don't have to look like the output of a printf! h]h)}(hXthe status table rerun gives runtimes for plotting based on Parsl level dfk/task/try but doesn't give any metadata about those. such as app name. in SQL this is added on as a JOIN, and so it is here too - rerun the tasks table as a sequence of log records - note that they don't have a notion of "created" here because they are records but aren't from a point in time, instead an already aggregated set of information. don't let that scare you. observability records don't have to look like the output of a printf!h]hXthe status table rerun gives runtimes for plotting based on Parsl level dfk/task/try but doesn’t give any metadata about those. such as app name. in SQL this is added on as a JOIN, and so it is here too - rerun the tasks table as a sequence of log records - note that they don’t have a notion of “created” here because they are records but aren’t from a point in time, instead an already aggregated set of information. don’t let that scare you. observability records don’t have to look like the output of a printf!}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjubah}(h]h]h]h]h!]uh%jh'h(h)Mhjh&hubah}(h]h]h]h]h!]enumtypearabicprefixhsuffix.uh%jhj:h&hh'h(h)Mubeh}(h]"prioritisation-part-2-by-task-typeah]h]#prioritisation part 2: by task typeah]h!]uh%h*hjh&hh'h(h)Mubeh}(h]4adventure-blog-visualization-for-task-prioritisationah]h]6adventure: blog: visualization for task prioritisationah]h!]uh%h*hjXh&hh'h(h)Mubh+)}(hhh](h0)}(h'Task flow logs through the whole systemh]h'Task flow logs through the whole system}(hj.h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj+h&hh'h(h)Mubh)}(hQHere's a use case that is hard with what exists in master-branch Parsl right now.h]hSHere’s a use case that is hard with what exists in master-branch Parsl right now.}(hj<h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj+h&hubh)}(hXI want to know, for a particular arbitrary task, the timings of the task as it is submitted by the user workflow, flows through the DFK, into the htex interchange, worker pool, executes on an htex worker, and flows back to the user, with the timing of each step.h]hXI want to know, for a particular arbitrary task, the timings of the task as it is submitted by the user workflow, flows through the DFK, into the htex interchange, worker pool, executes on an htex worker, and flows back to the user, with the timing of each step.}(hjJh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj+h&hubh)}(hXWhat exists in master Parsl right now is some information in monitoring, and some information in log files. The monitoring information is focused on the high level task model, not what is happening inside Parsl to run that high level model. Logs as they exist now are extremely ad-hoc, spread around in at least 4 different places, and poorly integrated: for example, log messages sometimes do not contain context about which task they refer to, do not represent that context uniformly (e.g. in a greppable way) and are ambiguous about context (e.g. some places refer to task 1, the DFK-level task 1, and some places refer to task 1, the HTEX-level task 1, which could be something completely different).h]hXWhat exists in master Parsl right now is some information in monitoring, and some information in log files. The monitoring information is focused on the high level task model, not what is happening inside Parsl to run that high level model. Logs as they exist now are extremely ad-hoc, spread around in at least 4 different places, and poorly integrated: for example, log messages sometimes do not contain context about which task they refer to, do not represent that context uniformly (e.g. in a greppable way) and are ambiguous about context (e.g. some places refer to task 1, the DFK-level task 1, and some places refer to task 1, the HTEX-level task 1, which could be something completely different).}(hjXh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj+h&hubh)}(hIAs a contrast, an example output of this prototype (as of 2025-10-26) is:h]hIAs a contrast, an example output of this prototype (as of 2025-10-26) is:}(hjfh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj+h&hubjL)}(hXv === About task 358 === 2025-10-26 10:29:46.467298 MainThread@117098 Task 358: will be sent to executor htex_Local (parsl.log) 2025-10-26 10:29:46.467412 MainThread@117098 Task 358: Adding output dependencies (parsl.log) 2025-10-26 10:29:46.467484 MainThread@117098 Task 358: Added output dependencies (parsl.log) 2025-10-26 10:29:46.467550 MainThread@117098 Task 358: Gathering dependencies: start (parsl.log) 2025-10-26 10:29:46.467620 MainThread@117098 Task 358: Gathering dependencies: end (parsl.log) 2025-10-26 10:29:46.467685 MainThread@117098 Task 358: submitted for App random_uuid, not waiting on any dependency (parsl.log) 2025-10-26 10:29:46.467752 MainThread@117098 Task 358: has AppFuture: (parsl.log) 2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log) 2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log) 2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log) 2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log) 2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log) 2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log) 2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log) 2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log) 2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log) 2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log) 2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log) 2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log) 2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log)h]hXv === About task 358 === 2025-10-26 10:29:46.467298 MainThread@117098 Task 358: will be sent to executor htex_Local (parsl.log) 2025-10-26 10:29:46.467412 MainThread@117098 Task 358: Adding output dependencies (parsl.log) 2025-10-26 10:29:46.467484 MainThread@117098 Task 358: Added output dependencies (parsl.log) 2025-10-26 10:29:46.467550 MainThread@117098 Task 358: Gathering dependencies: start (parsl.log) 2025-10-26 10:29:46.467620 MainThread@117098 Task 358: Gathering dependencies: end (parsl.log) 2025-10-26 10:29:46.467685 MainThread@117098 Task 358: submitted for App random_uuid, not waiting on any dependency (parsl.log) 2025-10-26 10:29:46.467752 MainThread@117098 Task 358: has AppFuture: (parsl.log) 2025-10-26 10:29:46.467818 MainThread@117098 Task 358: initializing state to pending (parsl.log) 2025-10-26 10:29:46.469992 Task-Launch_0@117098 Task 358: changing state from pending to launched (parsl.log) 2025-10-26 10:29:46.470113 Task-Launch_0@117098 Task 358: try 0 launched on executor htex_Local with executor id 340 (parsl.log) 2025-10-26 10:29:46.470240 Task-Launch_0@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.470310 Task-Launch_0@117098 Task 358: Standard error will not be redirected. (parsl.log) 2025-10-26 10:29:46.470336 MainThread@117129 HTEX task 340: putting onto pending_task_queue (interchange log) 2025-10-26 10:29:46.470404 MainThread@117129 HTEX task 340: fetched task (interchange log) 2025-10-26 10:29:46.470815 Interchange-Communicator@117144 Putting HTEX task 340 into scheduler (Pool manager log) 2025-10-26 10:29:46.471166 MainThread@117162 HTEX task 340: received executor task (Pool worker log) 2025-10-26 10:29:46.492449 MainThread@117162 HTEX task 340: Completed task (Pool worker log) 2025-10-26 10:29:46.492742 MainThread@117162 HTEX task 340: All processing finished for task (Pool worker log) 2025-10-26 10:29:46.493508 MainThread@117129 HTEX task 340: Manager b'4f65802901c6': Removing task from manager (interchange log) 2025-10-26 10:29:46.493948 HTEX-Result-Queue-Thread@117098 Task 358: changing state from launched to exec_done (parsl.log) 2025-10-26 10:29:46.494729 HTEX-Result-Queue-Thread@117098 Task 358: Standard out will not be redirected. (parsl.log) 2025-10-26 10:29:46.494905 HTEX-Result-Queue-Thread@117098 Task 358: Standard error will not be redirected. (parsl.log)}hjtsbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)Mhj+h&hubh)}(h`This integrates four log files and two task identifier systems into a single sequence of events.h]h`This integrates four log files and two task identifier systems into a single sequence of events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj+h&hubeh}(h]'task-flow-logs-through-the-whole-systemah]h]'task flow logs through the whole systemah]h!]uh%h*hjXh&hh'h(h)Mubh+)}(hhh](h0)}(h/Algebra of rearranging and querying wide eventsh]h/Algebra of rearranging and querying wide events}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)Mubh)}(hThese are some of the standard patterns I've found useful enough and straightforward to turn into library functions in the `parsl.observabilty` module.h](h}These are some of the standard patterns I’ve found useful enough and straightforward to turn into library functions in the }(hjh&hh'Nh)Nubj )}(h`parsl.observabilty`h]hparsl.observabilty}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hjubh module.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M!hjh&hubh)}(h4look at relational algebra for phrasing and conceptsh]h4look at relational algebra for phrasing and concepts}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M#hjh&hubh+)}(hhh](h0)}(h functorialh]h functorial}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)M&ubh)}(hThese operations are *functorial* in the sense that they operate on individual event records without regard for the context those records are in.h](hThese operations are }(hjh&hh'Nh)Nubh)}(h *functorial*h]h functorial}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjubhp in the sense that they operate on individual event records without regard for the context those records are in.}(hjh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M(hjh&hubh)}(hWideningh]hWidening}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M*hjh&hubh)}(hwiden-by-constant: if we import a new log file but we know its broader context some other way, perhaps because it came from a known directory inside a parsl rundir (eg work queue's )h]hwiden-by-constant: if we import a new log file but we know its broader context some other way, perhaps because it came from a known directory inside a parsl rundir (eg work queue’s )}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M,hjh&hubh)}(hdrelabelling - to make names align from multiple sources, or to add distinction from multiple sourcesh]hdrelabelling - to make names align from multiple sources, or to add distinction from multiple sources}(hj(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M.hjh&hubeh}(h] functorialah]h] functorialah]h!]uh%h*hjh&hh'h(h)M&ubh+)}(hhh](h0)}(hnon-functorialh]hnon-functorial}(hjAh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj>h&hh'h(h)M1ubh)}(hThese operations are *not functorial*. For example, widening-by-implication copies information between event records that are somehow grouped into a collection together, so cannot be implemented on a record-by-record basis.h](hThese operations are }(hjOh&hh'Nh)Nubh)}(h*not functorial*h]hnot functorial}(hjWh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhjOubh. For example, widening-by-implication copies information between event records that are somehow grouped into a collection together, so cannot be implemented on a record-by-record basis.}(hjOh&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M3hj>h&hubh)}(hHpost-facto relationship establishment using grouping and key implicationh]hHpost-facto relationship establishment using grouping and key implication}(hjoh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M5hj>h&hubh)}(h+used where you might use a ``JOIN`` in SQL.h](hused where you might use a }(hj}h&hh'Nh)Nubjn)}(h``JOIN``h]hJOIN}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj}ubh in SQL.}(hj}h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M7hj>h&hubh)}(hhh]h}(h]h]h]h]h!]hP](hrelational algebraindex-23hNtahuh%hh'h(h)M9hj>h&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhj>h&hh'h(h)M;ubh)}(hXnotion of identity and key-sequences: eg. parsl_dfk/parsl_task_id is a globally unique identifier for a parsl task across time and space, and so is parsl_dfk/executor_label/block_number or parsl_dfk/executor_label/manager_id/worker_number -- although manager ID is also (in short form) globally unique. this is distinct from the hierarchical relations between entities - although hierarchical identity keys will often line up with execution hierarchy.h]hXnotion of identity and key-sequences: eg. parsl_dfk/parsl_task_id is a globally unique identifier for a parsl task across time and space, and so is parsl_dfk/executor_label/block_number or parsl_dfk/executor_label/manager_id/worker_number – although manager ID is also (in short form) globally unique. this is distinct from the hierarchical relations between entities - although hierarchical identity keys will often line up with execution hierarchy.}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)M<hj>h&hh}h}jjsubh)}(hepeter buneman XML keys stuff did nested sequences of keys for identifying xml fragments, c. year 2000h]hepeter buneman XML keys stuff did nested sequences of keys for identifying xml fragments, c. year 2000}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MGhj>h&hubh)}(hjoins can send info back in time: if we have a span but don't know which parsl task it belongs to at the start, only later, we can use joins to bring that information from the future.h]hjoins can send info back in time: if we have a span but don’t know which parsl task it belongs to at the start, only later, we can use joins to bring that information from the future.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MJhj>h&hubh+)}(hhh](h0)}(hkeys imply key operatorh]hkeys imply key operator}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjh&hh'h(h)MMubh)}(h+[l_keys] implies [r_key] over [collection]:h]h+[l_keys] implies [r_key] over [collection]:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MOhjh&hubj)}(hhh](j)}(hif any log selected by l_keys contains an r_key, that r_key is unique (auto-check-that) and should be attached to every log record selected by l_keys. Use case: widening the task reception span in idris2interchange to be labelled with htex_task_idh]h)}(hjh]hif any log selected by l_keys contains an r_key, that r_key is unique (auto-check-that) and should be attached to every log record selected by l_keys. Use case: widening the task reception span in idris2interchange to be labelled with htex_task_id}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MQhjubah}(h]h]h]h]h!]uh%jh'h(h)MQhjh&hubj)}(hSthis is functional dependency: https://en.wikipedia.org/wiki/Functional_dependency h]h)}(hRthis is functional dependency: https://en.wikipedia.org/wiki/Functional_dependencyh](hthis is functional dependency: }(hjh&hh'Nh)Nubh)}(h3https://en.wikipedia.org/wiki/Functional_dependencyh]h3https://en.wikipedia.org/wiki/Functional_dependency}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]refurij&uh%hhjubeh}(h]h]h]h]h!]uh%hh'h(h)MRhjubah}(h]h]h]h]h!]uh%jh'h(h)MRhjh&hubeh}(h]h]h]h]h!]jjuh%jh'h(h)MQhjh&hubh)}(hfixpoint notions that might need to be incorporated into the query model in Python code (so that a fixpoint can be converged to across non-local widening queries - see idris2interchange usecase notes)h]hfixpoint notions that might need to be incorporated into the query model in Python code (so that a fixpoint can be converged to across non-local widening queries - see idris2interchange usecase notes)}(hjEh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MThjh&hubh)}(hXLots of different identifier spaces, loosely structured, not necessarily hierarchical: for example, an htex task is not necessarily "inside" a Parsl task, as htex can be used outside of a Parsl DFK (which is where the notion of Parsl task lives). An htex task often runs in a unix process but that process also runs many htex tasks, and an htex task also has extent outside of that worker process: there's no containment relationship either way.h]hXLots of different identifier spaces, loosely structured, not necessarily hierarchical: for example, an htex task is not necessarily “inside” a Parsl task, as htex can be used outside of a Parsl DFK (which is where the notion of Parsl task lives). An htex task often runs in a unix process but that process also runs many htex tasks, and an htex task also has extent outside of that worker process: there’s no containment relationship either way.}(hjSh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MVhjh&hubeh}(h]keys-imply-key-operatorah]h]keys imply key operatorah]h!]uh%h*hj>h&hh'h(h)MMubeh}(h]non-functorialah]h]non-functorialah]h!]uh%h*hjh&hh'h(h)M1ubeh}(h]/algebra-of-rearranging-and-querying-wide-eventsah]h]/algebra of rearranging and querying wide eventsah]h!]uh%h*hjXh&hh'h(h)Mubh+)}(hhh](h0)}(hAdventure: Browser UIh]hAdventure: Browser UI}(hj|h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjyh&hh'h(h)Maubh)}(hhh]h}(h]h]h]h]h!]hP](h web browserindex-24hNtahuh%hh'h(h)Mchjyh&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjyh&hh'h(h)Mdubh)}(h+What might a browser UI look like for this?h]h+What might a browser UI look like for this?}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)Mehjyh&hh}h}jjsubh)}(hcompare parsl-visualize. compare scrolling through logs, but with some more interactivity (eg. click / choose "show me logs from same dfk/task_id")h]hcompare parsl-visualize. compare scrolling through logs, but with some more interactivity (eg. click / choose “show me logs from same dfk/task_id”)}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mghjyh&hubh)}(hBut the parsl-visualize UI is so limited, it only has a handful of graphs to recreate. And some of them do not make sense to me so I would not recreate them.h]hBut the parsl-visualize UI is so limited, it only has a handful of graphs to recreate. And some of them do not make sense to me so I would not recreate them.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mkhjyh&hubh)}(hI am not super excited about building UIs but it would probably be interesting to build something simple that can do a few queries and graphs to demonstrate log analysis in clickable form.h]hI am not super excited about building UIs but it would probably be interesting to build something simple that can do a few queries and graphs to demonstrate log analysis in clickable form.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mmhjyh&hubh)}(hAnd then I could put in the analyses I have made (other graphs/reports) too, and also have it work with academy logs right away. and be ready to pull in other JSON log files as a more advanced implementation motivating JSON/wide events.h]hAnd then I could put in the analyses I have made (other graphs/reports) too, and also have it work with academy logs right away. and be ready to pull in other JSON log files as a more advanced implementation motivating JSON/wide events.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mohjyh&hubh)}(hUse python and matplotlib, no web-specific stuff, to promote people who have done local scripting putting new plots into the UI, and promote using the graph code from the visualiser in own local scripting.h]hUse python and matplotlib, no web-specific stuff, to promote people who have done local scripting putting new plots into the UI, and promote using the graph code from the visualiser in own local scripting.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mqhjyh&hubh)}(hhMake it able to address a whole collection of monitoring.db runs at once - not only one chosen workflow.h]hhMake it able to address a whole collection of monitoring.db runs at once - not only one chosen workflow.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mshjyh&hubh)}(hOUse a parsl-aware list of quasi-hierarchical key names to drive narrow-down UI:h]hOUse a parsl-aware list of quasi-hierarchical key names to drive narrow-down UI:}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Muhjyh&hubh)}(hreg: pick dfk. pick: parsl task, parsl try, or executor, then executor task, or executor then block then task.h]hreg: pick dfk. pick: parsl task, parsl try, or executor, then executor task, or executor then block then task.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mwhjyh&hubh)}(h,htex instance/block/manager/worker/htex_taskh]h,htex instance/block/manager/worker/htex_task}(hj h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Myhjyh&hubh)}(hhtex instance/block/manager/worker is an execution location - how an htex instance is identified is different between real Parsl and GC: in real parsl, its a dfk id/executor label.h]hhtex instance/block/manager/worker is an execution location - how an htex instance is identified is different between real Parsl and GC: in real parsl, its a dfk id/executor label.}(hj.h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M{hjyh&hubh)}(hXare all graphs relevant for all key selections? or should eg. a block duration/count graph only appear in certain situations? eg if we've focused on one task-try, does that mean... no block status info and so no block graph? graphs could be enabled by: "if you see records like this, this graph is relevant". That would allow eg. enabling htex or WQ specific plots if we see (with more merged info) some htex or WQ specific data. If we only see academy or GC logs, should only report about them.h]hXare all graphs relevant for all key selections? or should eg. a block duration/count graph only appear in certain situations? eg if we’ve focused on one task-try, does that mean… no block status info and so no block graph? graphs could be enabled by: “if you see records like this, this graph is relevant”. That would allow eg. enabling htex or WQ specific plots if we see (with more merged info) some htex or WQ specific data. If we only see academy or GC logs, should only report about them.}(hj<h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M}hjyh&hubh)}(h7Recreate block vs task count graph from Matthews paper.h]h7Recreate block vs task count graph from Matthews paper.}(hjJh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjyh&hubh)}(hX?Aim for first iteration to work against current monitoring.db format so it can be tried out in a separate install against production runs, distinct from all other observability work. Exensibility there right from the start to allow that to extend for importing new data and plugging in plots and reports about new data.h]hX?Aim for first iteration to work against current monitoring.db format so it can be tried out in a separate install against production runs, distinct from all other observability work. Exensibility there right from the start to allow that to extend for importing new data and plugging in plots and reports about new data.}(hjXh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjyh&hubh)}(htwo obvious non-monitoring.db extensions: what's happening with managers in blocks. whats happening with work queue. these are both executor specific, and don't fit the monitoring.db schema so well. so clear demos of what could be done better.h]htwo obvious non-monitoring.db extensions: what’s happening with managers in blocks. whats happening with work queue. these are both executor specific, and don’t fit the monitoring.db schema so well. so clear demos of what could be done better.}(hjfh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjyh&hubh+)}(hhh](h0)}(hIdea: Streaming-fold web UIh]hIdea: Streaming-fold web UI}(hjwh&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjth&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP](hidea; streaming-fold web UIindex-25hNtahuh%hh'h(h)Mhjth&hubh)}(hhh]h}(h]h]h]h]h!]hjuh%hhjth&hh'h(h)Mubh)}(hwhat operators can be build with a streaming-fold? to give live updates as logs come in. (eg tail from a filesystem in the simplest case)h]hwhat operators can be build with a streaming-fold? to give live updates as logs come in. (eg tail from a filesystem in the simplest case)}(hjh&hh'Nh)Nubah}(h]jah]h]h]h!]uh%hh'h(h)Mhjth&hh}h}jjsubh)}(hjoins are the hard bit there, I think - but a fundep operator is at least constrained in its behaviour: cache keyed-but-unjoined blocks, if we see a key record, emit the whole block and forget it.h]hjoins are the hard bit there, I think - but a fundep operator is at least constrained in its behaviour: cache keyed-but-unjoined blocks, if we see a key record, emit the whole block and forget it.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjth&hubh)}(hspend 1 day prototyping this.h]hspend 1 day prototyping this.}(hjh&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjth&hubh)}(h>> import globus_compute_sdk as s >>> e=s.Executor(endpoint_id='5c7202b5-d022-4534-a399-21d4356129be') >>> # authorization happens here >>> assert e.submit(abs, -7).result() == 7h]h>>> import globus_compute_sdk as s >>> e=s.Executor(endpoint_id='5c7202b5-d022-4534-a399-21d4356129be') >>> # authorization happens here >>> assert e.submit(abs, -7).result() == 7}hj!sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^pythonuh%jKh'h(h)Mhjs!h&hubh)}(hXgIn this environment, the only task reference is in ``~/.globus_compute/uep.5c7202b5-d022-4534-a399-21d4356129be.bac77271-9a60-cad7-c4e9-0eb689ddf4d1/GlobusComputeEngine-HighThroughputExecutor/block-0/beb8bc8bb003/worker_0.log`` using htex task numbers. This is not correlated with anything visible on the submit side -- which is a UUID globus compute task ID:h](h3In this environment, the only task reference is in }(hj!h&hh'Nh)Nubjn)}(h``~/.globus_compute/uep.5c7202b5-d022-4534-a399-21d4356129be.bac77271-9a60-cad7-c4e9-0eb689ddf4d1/GlobusComputeEngine-HighThroughputExecutor/block-0/beb8bc8bb003/worker_0.log``h]h~/.globus_compute/uep.5c7202b5-d022-4534-a399-21d4356129be.bac77271-9a60-cad7-c4e9-0eb689ddf4d1/GlobusComputeEngine-HighThroughputExecutor/block-0/beb8bc8bb003/worker_0.log}(hj!h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj!ubh using htex task numbers. This is not correlated with anything visible on the submit side – which is a UUID globus compute task ID:}(hj!h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhjs!h&hubjL)}(hL>>> f=e.submit(abs, -8) >>> f.task_id '8b3da38f-f889-4ae1-8d31-68ff2f830876'h]hL>>> f=e.submit(abs, -8) >>> f.task_id '8b3da38f-f889-4ae1-8d31-68ff2f830876'}hj!sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^pythonuh%jKh'h(h)Mhjs!h&hubh)}(hSuggestion: how to correlate GC task IDs with htex task IDs. Note that htex task IDs are not unique within the endpoint directory because multiple HTEXs (over time) log into the same directory.h]hSuggestion: how to correlate GC task IDs with htex task IDs. Note that htex task IDs are not unique within the endpoint directory because multiple HTEXs (over time) log into the same directory.}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjs!h&hubh)}(hLblock IDs are also not unique because of the above log directory conflation.h]hLblock IDs are also not unique because of the above log directory conflation.}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjs!h&hubh)}(hParsl work with extra debug info is likely to give more task information here, but all stil correlated by htex task ID, which as mentioned above, is not even unique within an endpoint.h]hParsl work with extra debug info is likely to give more task information here, but all stil correlated by htex task ID, which as mentioned above, is not even unique within an endpoint.}(hj "h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjs!h&hubeh}(h]getting-startedah]h]getting startedah]h!]uh%h*hjT!h&hh'h(h)Mubh+)}(hhh](h0)}(hLaunching an academy agenth]hLaunching an academy agent}(hj9"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj6"h&hh'h(h)Mubh)}(hThere are no academy agents defined by default in an Academy install. If I manually define one in my submit container (in a disposable place, not shared), what happens when I try to launch it?h]hThere are no academy agents defined by default in an Academy install. If I manually define one in my submit container (in a disposable place, not shared), what happens when I try to launch it?}(hjG"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj6"h&hubh)}(hThis works to a remote endpoint, from one cloudlump container to another, which surprises me a bit because the MyAgent code must be being conveyed by some bowels of GC serialization...h]hThis works to a remote endpoint, from one cloudlump container to another, which surprises me a bit because the MyAgent code must be being conveyed by some bowels of GC serialization…}(hjU"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj6"h&hubjL)}(hX\import asyncio import globus_compute_sdk as gc import academy.agent as aa import academy.manager as am import academy.exchange as ae import concurrent.futures as cf print("importing myagent main") class MyAgent(aa.Agent): @aa.action async def seven(self): print(f"something for stdout from MyAgent {self!r}") import os return (7, os.getpid(), os.uname()) if __name__ == "__main__": async def main(): # async with await am.Manager.from_exchange_factory(factory=ae.HttpExchangeFactory(auth_method='globus', url="https://exchange.academy-agents.org"), executors = cf.ProcessPoolExecutor()) as m: async with await am.Manager.from_exchange_factory(factory=ae.HttpExchangeFactory(auth_method='globus', url="https://exchange.academy-agents.org"), executors = gc.Executor(endpoint_id='5c7202b5-d022-4534-a399-21d4356129be')) as m: print(f"with manager {m}") h = await m.launch(MyAgent()) print(f"launched agent with handle {h}") s = await h.seven() print(f"agent seven result is {s}") assert s[0] == 7 asyncio.run(main())h]hX\import asyncio import globus_compute_sdk as gc import academy.agent as aa import academy.manager as am import academy.exchange as ae import concurrent.futures as cf print("importing myagent main") class MyAgent(aa.Agent): @aa.action async def seven(self): print(f"something for stdout from MyAgent {self!r}") import os return (7, os.getpid(), os.uname()) if __name__ == "__main__": async def main(): # async with await am.Manager.from_exchange_factory(factory=ae.HttpExchangeFactory(auth_method='globus', url="https://exchange.academy-agents.org"), executors = cf.ProcessPoolExecutor()) as m: async with await am.Manager.from_exchange_factory(factory=ae.HttpExchangeFactory(auth_method='globus', url="https://exchange.academy-agents.org"), executors = gc.Executor(endpoint_id='5c7202b5-d022-4534-a399-21d4356129be')) as m: print(f"with manager {m}") h = await m.launch(MyAgent()) print(f"launched agent with handle {h}") s = await h.seven() print(f"agent seven result is {s}") assert s[0] == 7 asyncio.run(main())}hjc"sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^pythonuh%jKh'h(h)Mhj6"h&hubh)}(habut if I put the agent in its own agentcode.py module, then I get a remote deserialization error:h]habut if I put the agent in its own agentcode.py module, then I get a remote deserialization error:}(hju"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hj6"h&hubjL)}(hXc... File "/venv/lib/python3.13/site-packages/dill/_dill.py", line 452, in load obj = StockUnpickler.load(self) File "/venv/lib/python3.13/site-packages/dill/_dill.py", line 442, in find_class return StockUnpickler.find_class(self, module, name) ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'agentcode'h]hXc... File "/venv/lib/python3.13/site-packages/dill/_dill.py", line 452, in load obj = StockUnpickler.load(self) File "/venv/lib/python3.13/site-packages/dill/_dill.py", line 442, in find_class return StockUnpickler.find_class(self, module, name) ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'agentcode'}hj"sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^j_uh%jKh'h(h)M hj6"h&hubh)}(hwhich is consistent: in the first example, dill is used to convey the definitions; in the second case, pickle thinks it can do an ``import`` and so never gets to the point of dill conveying the definitions.h](hwhich is consistent: in the first example, dill is used to convey the definitions; in the second case, pickle thinks it can do an }(hj"h&hh'Nh)Nubjn)}(h ``import``h]himport}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%jmhj"ubhB and so never gets to the point of dill conveying the definitions.}(hj"h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj6"h&hubeh}(h]launching-an-academy-agentah]h]launching an academy agentah]h!]uh%h*hjT!h&hh'h(h)Mubh+)}(hhh](h0)}(h(Looking at GC-endpoint-side academy logsh]h(Looking at GC-endpoint-side academy logs}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj"h&hh'h(h)Mubh)}(haFor example, I want to see the agent action invocations. There are two paths here I might expect:h]haFor example, I want to see the agent action invocations. There are two paths here I might expect:}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj"h&hubj)}(hhh]j)}(hXsome kind of endpoint/worker level as the academy docs suggest running as a worker initialization in the process worker pool. GC workers (and indeed, HTEX workers) don't have a configuration interface that supports that well, although in pure this observability project is working towards that -- see the configurability section nearer the start. I might expect that as part of general observability of *the whole system* rather than hoping that the *other components* are themselves separately debuggable. h]h)}(hXsome kind of endpoint/worker level as the academy docs suggest running as a worker initialization in the process worker pool. GC workers (and indeed, HTEX workers) don't have a configuration interface that supports that well, although in pure this observability project is working towards that -- see the configurability section nearer the start. I might expect that as part of general observability of *the whole system* rather than hoping that the *other components* are themselves separately debuggable.h](hXsome kind of endpoint/worker level as the academy docs suggest running as a worker initialization in the process worker pool. GC workers (and indeed, HTEX workers) don’t have a configuration interface that supports that well, although in pure this observability project is working towards that – see the configurability section nearer the start. I might expect that as part of general observability of }(hj"h&hh'Nh)Nubh)}(h*the whole system*h]hthe whole system}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhj"ubh rather than hoping that the }(hj"h&hh'Nh)Nubh)}(h*other components*h]hother components}(hj"h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhj"ubh& are themselves separately debuggable.}(hj"h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj"ubah}(h]h]h]h]h!]uh%jh'h(h)Mhj"h&hubah}(h]h]h]h]h!]jjuh%jh'h(h)Mhj"h&hubh)}(hvagent-level log routing: start something at agent start, shut it down at agent end. There are two existing approaches:h]hvagent-level log routing: start something at agent start, shut it down at agent end. There are two existing approaches:}(hj #h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hj"h&hubj)}(hhh](j)}(hI've prototyped making agents able to capture "their own" logs and report them via an agent action. I prototyped this with Logan, and mentioned it elsewhere in this report.h]h)}(hj3#h]hI’ve prototyped making agents able to capture “their own” logs and report them via an agent action. I prototyped this with Logan, and mentioned it elsewhere in this report.}(hj5#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M"hj1#ubah}(h]h]h]h]h!]uh%jh'h(h)M"hj.#h&hubj)}(hAlok added an initialize logging feature to manager launching of agents to insert a log file capture. There is no facility there for conveying the log file anywhere else. h]h)}(hAlok added an initialize logging feature to manager launching of agents to insert a log file capture. There is no facility there for conveying the log file anywhere else.h]hAlok added an initialize logging feature to manager launching of agents to insert a log file capture. There is no facility there for conveying the log file anywhere else.}(hjL#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M#hjH#ubah}(h]h]h]h]h!]uh%jh'h(h)M#hj.#h&hubeh}(h]h]h]h]h!]jjuh%jh'h(h)M"hj"h&hubh)}(hsThese different approaches are not contradictory: the Python logging mechanism can cope with multiple log handlers.h]hsThese different approaches are not contradictory: the Python logging mechanism can cope with multiple log handlers.}(hjf#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M%hj"h&hubh)}(htA goal: I want to look at agent activity - logs or visualization? - of stuff on the submit side and the remote side.h]htA goal: I want to look at agent activity - logs or visualization? - of stuff on the submit side and the remote side.}(hjt#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M'hj"h&hubh)}(hFor example: I want to run my fibonacci agent test and see the agent logging its internal state as it changes, as well as seeing the client reporting what it sees.h]hFor example: I want to run my fibonacci agent test and see the agent logging its internal state as it changes, as well as seeing the client reporting what it sees.}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M)hj"h&hubh)}(hXCI should be able to use multiple approaches to demonstrate how these events can be retrieved and give the same (or similar) output on the analysis side, and the differences in characteristics of the approach would be interesting to comment on - with the logging and analysis code the same for all event-movement approaches.h]hXCI should be able to use multiple approaches to demonstrate how these events can be retrieved and give the same (or similar) output on the analysis side, and the differences in characteristics of the approach would be interesting to comment on - with the logging and analysis code the same for all event-movement approaches.}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M+hj"h&hubh)}(hXAcademy manager has an option to initialize remote logging at start of agent execution -- that's one of two remote logging hooks that exists already (the other is on user-side in agent startup). so lets see how they compare. the main questions: how much ahead-of-startup logging do I get from one but not the other? how much do I want my own free-coding configurability? e.g. for formatting or octopus-style?h]hXAcademy manager has an option to initialize remote logging at start of agent execution – that’s one of two remote logging hooks that exists already (the other is on user-side in agent startup). so lets see how they compare. the main questions: how much ahead-of-startup logging do I get from one but not the other? how much do I want my own free-coding configurability? e.g. for formatting or octopus-style?}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M-hj"h&hubh)}(hX.eg. manager.launch(init_logging) wants to only run once, even though it has logfiles and log levels specified per-agent. that's a process vs "entity" inconsistency to think about. also doesn't have an extralog specifier. rather than wiring in yet another one, look at my more general log configuration.h]hX6eg. manager.launch(init_logging) wants to only run once, even though it has logfiles and log levels specified per-agent. that’s a process vs “entity” inconsistency to think about. also doesn’t have an extralog specifier. rather than wiring in yet another one, look at my more general log configuration.}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M/hj"h&hubh)}(hThis doesn't having configurable format, and it captures bits of logs from other stuff (such as the parsl worker_log entries). which is desirable sometimes, but perhaps more on an endpoint-wide basis.h]hThis doesn’t having configurable format, and it captures bits of logs from other stuff (such as the parsl worker_log entries). which is desirable sometimes, but perhaps more on an endpoint-wide basis.}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M1hj"h&hubh)}(hIt looks like it also doesn't *deinitialize* logging - because its a hack scoped around the process, rather than a principled log bracketing.h](h It looks like it also doesn’t }(hj#h&hh'Nh)Nubh)}(h*deinitialize*h]h deinitialize}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hhj#ubha logging - because its a hack scoped around the process, rather than a principled log bracketing.}(hj#h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)M3hj"h&hubh)}(hNow look at if I make the agent initialize its own logging- especially addressing the rough edges about: deinitialization, parameterisation.h]hNow look at if I make the agent initialize its own logging- especially addressing the rough edges about: deinitialization, parameterisation.}(hj#h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M5hj"h&hubeh}(h](looking-at-gc-endpoint-side-academy-logsah]h](looking at gc-endpoint-side academy logsah]h!]uh%h*hjT!h&hh'h(h)Mubeh}(h]#adventure-academy-vs-globus-computeah]h]$adventure: academy vs globus computeah]h!]uh%h*hhh&hh'h(h)Mubh+)}(hhh](h0)}(hThe resth]hThe rest}(hj $h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj$h&hh'h(h)M8ubh+)}(hhh](h0)}(hEDebugging monitoring performance as part of developing this prototypeh]hEDebugging monitoring performance as part of developing this prototype}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj$h&hh'h(h)M;ubh)}(hffindcommon tool - finds common task sequence for templated logs and outputs their sequence, like this:h]hffindcommon tool - finds common task sequence for templated logs and outputs their sequence, like this:}(hj($h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M=hj$h&hubh)}(hFirst run parsl-perf like this:h]hFirst run parsl-perf like this:}(hj6$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M?hj$h&hubjL)}(hXxparsl-perf --config parsl/tests/configs/htex_local.py [...] ==== Iteration 3 ==== Will run 58179 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 58179 tasks submitted ... waiting for completion Submission took 103.880 seconds = 560.059 tasks/second Runtime: actual 137.225s vs target 120s Tasks per second: 423.967 Tests complete - leaving DFK blockh]hXxparsl-perf --config parsl/tests/configs/htex_local.py [...] ==== Iteration 3 ==== Will run 58179 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 58179 tasks submitted ... waiting for completion Submission took 103.880 seconds = 560.059 tasks/second Runtime: actual 137.225s vs target 120s Tasks per second: 423.967 Tests complete - leaving DFK block}hjD$sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)MAhj$h&hubh)}(h*which executes a total around 60000 tasks.h]h*which executes a total around 60000 tasks.}(hjV$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MPhj$h&hubh)}(hyFirst, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at .h]hyFirst, note that this prototype benchmarks on my laptop significantly slower than the contemperaneous master branch, at .}(hjd$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MRhj$h&hubh)}(hXZThat's perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this.h]hX\That’s perhaps unsurprising: this benchmark is incredibly log sensistive, as my previous posts have noted - TODO: link to blog post and to R-performance report) - around 900 tasks per second on a 120 second benchmark. And this prototype adds a lot of log output. Part of the path to productionisation would be understanding and constraining this.}(hjr$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MThj$h&hubh)}(hXFrom that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what's taking up time there?h]hXFrom that output above, it is clear that the submission loop is taking a long time: 100 seconds. With about 35 seconds of execution happening afterwards. The Parsl core should be able to process task submissions much faster than 560 tasks per seconds. So what’s taking up time there?}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MVhj$h&hubh)}(heRun findcommon (a could-be-modular-but-isn't helper from this observability prototype) on the result:h]hgRun findcommon (a could-be-modular-but-isn’t helper from this observability prototype) on the result:}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MXhj$h&hubjL)}(hX0.0: Task %s: will be sent to executor htex_local 0.00023320618468031343: Task %s: Adding output dependencies 0.0004515730863634116: Task %s: Added output dependencies 0.000672943356177761: Task %s: Gathering dependencies: start 0.0008952160973877195: Task %s: Gathering dependencies: end 0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency 0.001316777690507145: Task %s: has AppFuture: %s 0.0015680651123983979: Task %s: initializing state to pending 23.684763520758917: HTEX task %s: putting onto pending_task_queue 23.68483662049256: HTEX task %s: fetched task 23.684863335335613: Task %s: changing state from pending to launched 23.6850573607536: Task %s: try %s launched on executor %s with executor id %s 23.685248910492184: Task %s: Standard out will not be redirected. 23.685424046734745: Task %s: Standard error will not be redirected. 23.686276226995773: Putting HTEX task %s into scheduler 23.686777094898495: HTEX task %s: received executor task 23.687025900194147: HTEX task %s: Completed task 23.687268549254735: HTEX task %s: All processing finished for task 23.687837933843614: HTEX task %s: Manager %r: Removing task from manager 23.688483699079185: Task %s: changing state from launched to exec_doneh]hX0.0: Task %s: will be sent to executor htex_local 0.00023320618468031343: Task %s: Adding output dependencies 0.0004515730863634116: Task %s: Added output dependencies 0.000672943356177761: Task %s: Gathering dependencies: start 0.0008952160973877195: Task %s: Gathering dependencies: end 0.0011054732824941516: Task %s: submitted for App app, not waiting on any dependency 0.001316777690507145: Task %s: has AppFuture: %s 0.0015680651123983979: Task %s: initializing state to pending 23.684763520758917: HTEX task %s: putting onto pending_task_queue 23.68483662049256: HTEX task %s: fetched task 23.684863335335613: Task %s: changing state from pending to launched 23.6850573607536: Task %s: try %s launched on executor %s with executor id %s 23.685248910492184: Task %s: Standard out will not be redirected. 23.685424046734745: Task %s: Standard error will not be redirected. 23.686276226995773: Putting HTEX task %s into scheduler 23.686777094898495: HTEX task %s: received executor task 23.687025900194147: HTEX task %s: Completed task 23.687268549254735: HTEX task %s: All processing finished for task 23.687837933843614: HTEX task %s: Manager %r: Removing task from manager 23.688483699079185: Task %s: changing state from launched to exec_done}hj$sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)MZhj$h&hubh)}(hXHIn this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That's fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf's submission time to completion time (30 seconds).h]hXLIn this stylised synthetic task trace, a task takes an average of 23 seconds to go from the first event (choosing executor) to the final mark as done. That’s fairly consistent with the parsl-perf output - I would expect the average here to be around half the time of parsl-perf’s submission time to completion time (30 seconds).}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mrhj$h&hubh)}(hWhat's useful with findcommon's output is that it shows the insides of Parsl's working in more depth: 20 states instead of parsl-perf's start, submitted, end. And the potential exists to calculate other statistics on these events.h]hWhat’s useful with findcommon’s output is that it shows the insides of Parsl’s working in more depth: 20 states instead of parsl-perf’s start, submitted, end. And the potential exists to calculate other statistics on these events.}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mthj$h&hubh)}(hSo in this average case, there's something slow happening between setting the task to pending, and then the task "simultaneously" being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue.h]hSo in this average case, there’s something slow happening between setting the task to pending, and then the task “simultaneously” being marked as launched on the submit side and the interchange receiving it and placing it in the pending task queue.}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mvhj$h&hubh)}(hgThat's a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange.h]hiThat’s a bit surprising - tasks are meant to accumulate in the interchange, not before the interchange.}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mxhj$h&hubh)}(hX3So let's perform some deeper investigations -- observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master).h]hX6So let’s perform some deeper investigations – observability is for Serious Investigators and so it is fine to be hacking on the Parsl source code to understand this more. (by hacking, I mean making temporary changes for the investigation that likely will be thrown away rather than integrated into master).}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mzhj$h&hubh)}(hLet's flesh out the whole submission process with some more log lines. On the DFK side, that's pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task.h]hLet’s flesh out the whole submission process with some more log lines. On the DFK side, that’s pretty straightforward: the observability prototype has a per-task logger which, if you have the task record, will attach log messages to the task.}(hj$h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M|hj$h&hubh)}(hFor example, here's the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task.h]hFor example, here’s the changes to add a log around the first call to launch_if_ready, which is probably the call that is launching the task.}(hj%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M~hj$h&hubjL)}(h+ task_logger.debug("TMP: dependencies added, calling launch_if_ready") self.launch_if_ready(task_record) + task_logger.debug("TMP: launch_if_ready returned")h]h+ task_logger.debug("TMP: dependencies added, calling launch_if_ready") self.launch_if_ready(task_record) + task_logger.debug("TMP: launch_if_ready returned")}hj%sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)Mhj$h&hubh)}(hMy suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I'm going to add more logging around that.h]hMy suspicion is that this is around the htex submission queues, with a secondary submission around the launch executor, so to start with I’m going to add more logging around that.}(hj"%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hThen rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor:h]hThen rerun parsl-perf and findcommon, without modifying either, and it turns out to be that secondary submission, the launch executor:}(hj0%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubjL)}(hX0.0020453477688227: Task %s: TMP: submitted into launch pool executor 0.002256870306434224: Task %s: TMP: launch_if_ready returned 14.073021359217009: Task %s: TMP: before submitter lock [...] 14.078550367412324: Task %s: changing state from launched to exec_doneh]hX0.0020453477688227: Task %s: TMP: submitted into launch pool executor 0.002256870306434224: Task %s: TMP: launch_if_ready returned 14.073021359217009: Task %s: TMP: before submitter lock [...] 14.078550367412324: Task %s: changing state from launched to exec_done}hj>%sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)Mhj$h&hubh)}(hDon't worry too much about the final time (14s) changing from 23s in the earlier run -- that's a characteristic of parsl-perf batch sizes that I'm working on in another branch.h]hDon’t worry too much about the final time (14s) changing from 23s in the earlier run – that’s a characteristic of parsl-perf batch sizes that I’m working on in another branch.}(hjP%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(h{If that's the case, I'd expect the thread pool executor, previously much faster than htex, to show similar characteristics:h]hIf that’s the case, I’d expect the thread pool executor, previously much faster than htex, to show similar characteristics:}(hj^%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hX\surprisingly, though although the throughput is not much much higher... the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn't so much waiting there - less than a second on average. That's possibly because the executor can get through tasks much faster so the queue doesn't build up so much?h]hXbsurprisingly, though although the throughput is not much much higher… the trace looks very different timewise. the bulk of the time here still happens at the same place, there isn’t so much waiting there - less than a second on average. That’s possibly because the executor can get through tasks much faster so the queue doesn’t build up so much?}(hjl%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubjL)}(hX==== Iteration 2 ==== Will run 68976 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 68976 tasks submitted ... waiting for completion Submission took 117.915 seconds = 584.965 tasks/second Runtime: actual 118.417s vs target 120s Tasks per second: 582.485h]hX==== Iteration 2 ==== Will run 68976 tasks to target 120 seconds runtime Submitting tasks / invoking apps All 68976 tasks submitted ... waiting for completion Submission took 117.915 seconds = 584.965 tasks/second Runtime: actual 118.417s vs target 120s Tasks per second: 582.485}hjz%sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)Mhj$h&hubjL)}(hX0.0: Task %s: will be sent to executor threads 0.00014157412110423425: Task %s: Adding output dependencies 0.0002898652725047201: Task %s: Added output dependencies 0.000425118042214259: Task %s: Gathering dependencies: start 0.0005696294991521399: Task %s: Gathering dependencies: end 0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency 0.0008433702196425292: Task %s: has AppFuture: %s 0.0010710284919573986: Task %s: initializing state to pending 0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready 0.0012973675719411494: Task %s: submitting into launch pool executor 0.0014397921284467212: Task %s: submitted into launch pool executor 0.0015767665501452072: Task %s: TMP: launch_if_ready returned 0.3143575128217656: Task %s: before submitter lock 0.31448896150771743: Task %s: after submitter lock, before executor.submit 0.3146383380777917: Task %s: after before executor.submit 0.3147926810507091: Task %s: changing state from pending to launched 0.3149239369413048: Task %s: try 0 launched on executor threads 0.31504996538376506: Task %s: Standard out will not be redirected. 0.31504996538376506: Task %s: Standard out will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.315319734920821: Task %s: changing state from launched to exec_doneh]hX0.0: Task %s: will be sent to executor threads 0.00014157412110423425: Task %s: Adding output dependencies 0.0002898652725047201: Task %s: Added output dependencies 0.000425118042214259: Task %s: Gathering dependencies: start 0.0005696294991521399: Task %s: Gathering dependencies: end 0.0006999648174108608: Task %s: submitted for App app, not waiting on any dependency 0.0008433702196425292: Task %s: has AppFuture: %s 0.0010710284919573986: Task %s: initializing state to pending 0.0011652027385929428: Task %s: TMP: dependencies added, calling launch_if_ready 0.0012973675719411494: Task %s: submitting into launch pool executor 0.0014397921284467212: Task %s: submitted into launch pool executor 0.0015767665501452072: Task %s: TMP: launch_if_ready returned 0.3143575128217656: Task %s: before submitter lock 0.31448896150771743: Task %s: after submitter lock, before executor.submit 0.3146383380777917: Task %s: after before executor.submit 0.3147926810507091: Task %s: changing state from pending to launched 0.3149239369413048: Task %s: try 0 launched on executor threads 0.31504996538376506: Task %s: Standard out will not be redirected. 0.31504996538376506: Task %s: Standard out will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.3151759985402679: Task %s: Standard error will not be redirected. 0.315319734920821: Task %s: changing state from launched to exec_done}hj%sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)Mhj$h&hubh)}(hXSo maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I'll have to implement that for this observability prototype, because it's not done already, but once it's done it should be reusable. and it should share most infrastructure with `findcommon`. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have "enough" tasks at all times to keep dispatching. With 8 cores on my laptop, I'd like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It's also important to note that this findcommon output shows latency, not throughput -- though high latency at particular points is an indication of throughput problems.h](hXSo maybe I can do some graphing of events to give more insight than these averages are showing. A favourite of mine from previous monitoring work is how many tasks are in each state at each moment in time. I’ll have to implement that for this observability prototype, because it’s not done already, but once it’s done it should be reusable. and it should share most infrastructure with }(hj%h&hh'Nh)Nubj )}(h `findcommon`h]h findcommon}(hj%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hj%ubhXj. Especially relevant is discovering where bottlenecks are: it looks like this is a parsl-affecting performance regression that might be keeping workers idle. For example, we could ask: does the interchange have “enough” tasks at all times to keep dispatching. With 8 cores on my laptop, I’d like it to have at least 8 tasks or so inside htex at any one time, but this looks like it might not be true. Hopefully graphing will reveal more. It’s also important to note that this findcommon output shows latency, not throughput – though high latency at particular points is an indication of throughput problems.}(hj%h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hXgOr, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they're a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation.h]hXiOr, I can look at how many tasks are in the interchange over time: there either is, or straightforwardly can be, a log line for that. That will fit a different model to the above log lines which are per-task. Instead they’re a metric on the state of one thing only: the interchange. of which there is only one, at least for the purposes of this investigation.}(hj%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hiAdd a new log line like this into the interchange at a suitable point (after task queueing, for example):h]hiAdd a new log line like this into the interchange at a suitable point (after task queueing, for example):}(hj%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubjL)}(h+ ql = len(self.pending_task_queue) + logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql})h]h+ ql = len(self.pending_task_queue) + logger.info(f"TMP: there are {ql} tasks in the pending task queue", extra={"metric": "pending_task_queue_length", "queued_tasks": ql})}hj%sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^noneuh%jKh'h(h)Mhj$h&hubh)}(hNow can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script:h]hNow can either look through the logs by hand to manually see the value. Or extract it programmatically and plot it with matplotlib, in an ad-hoc script:}(hj%h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubjL)}(hXimport matplotlib.pyplot as plt from parsl.observability.getlogs import getlogs logs = getlogs() # looking for these logs: # "metric": "pending_task_queue_length", "queued_tasks": ql}) metrics = [(float(l['created']), int(l['queued_tasks'])) for l in logs if 'metric' in l and l['metric'] == "pending_task_queue_length" ] plt.scatter(x=[m[0] for m in metrics], y=[m[1] for m in metrics]) plt.show()h]hXimport matplotlib.pyplot as plt from parsl.observability.getlogs import getlogs logs = getlogs() # looking for these logs: # "metric": "pending_task_queue_length", "queued_tasks": ql}) metrics = [(float(l['created']), int(l['queued_tasks'])) for l in logs if 'metric' in l and l['metric'] == "pending_task_queue_length" ] plt.scatter(x=[m[0] for m in metrics], y=[m[1] for m in metrics]) plt.show()}hj%sbah}(h]h]h]h]h!]forcehighlight_args}h#h$j^pythonuh%jKh'h(h)Mhj$h&hubh)}(htand indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10.h]htand indeed that shows that the interchange queue length almost never goes above length 1, and never above length 10.}(hj &h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hXCThat's enough for now, but it's a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a `for`-loop doesn't progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case.h](hThat’s enough for now, but it’s a usecase that shows partially understanding throughput: we can see from this observability data that the conceptual 50000 task queue that begins in parsl-perf as a }(hj&h&hh'Nh)Nubj )}(h`for`h]hfor}(hj"&h&hh'Nh)Nubah}(h]h]h]h]h!]uh%j hj&ubhX{-loop doesn’t progress fast enough to the interchange internal queue, and so probably performance effort should probably be focused on understanding and improving the code path around launch and getting into the interchange queue. With an almost empty interchange queue, anything happening on the worker side is probably not too relevant, at least for that parsl-perf use case.}(hj&h&hh'Nh)Nubeh}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubh)}(hThis "understand the queue lengths (or implicit queue lengths) towards execution" investigation style has been useful in understanding Parsl performance limitations in the past.h]hThis “understand the queue lengths (or implicit queue lengths) towards execution” investigation style has been useful in understanding Parsl performance limitations in the past.}(hj:&h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj$h&hubeh}(h]Edebugging-monitoring-performance-as-part-of-developing-this-prototypeah]h]Edebugging monitoring performance as part of developing this prototypeah]h!]uh%h*hj$h&hh'h(h)M;ubh+)}(hhh](h0)}(hSee alsoh]hSee also}(hjS&h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjP&h&hh'h(h)Mubh)}(hhh]h}(h]h]h]h]h!]hP](h NetLoggerindex-27hNtahuh%hh'h(h)MhjP&h&hubh)}(hhh]h}(h]h]h]h]h!]hjl&uh%hhjP&h&hh'h(h)Mubh)}(hGNetLogger - https://dst.lbl.gov/publications/NetLogger.tech-report.pdfh](h NetLogger - }(hjw&h&hh'Nh)Nubh)}(h:https://dst.lbl.gov/publications/NetLogger.tech-report.pdfh]h:https://dst.lbl.gov/publications/NetLogger.tech-report.pdf}(hj&h&hh'Nh)Nubah}(h]h]h]h]h!]refurij&uh%hhjw&ubeh}(h]jl&ah]h]h]h!]uh%hh'h(h)MhjP&h&hh}h}jl&jn&subh)}(hhh]h}(h]h]h]h]h!]hP](hdnpcindex-28hNtahuh%hh'h(h)MhjP&h&hubh)}(hhh]h}(h]h]h]h]h!]hj&uh%hhjP&h&hh'h(h)Mubh)}(hmy dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component.h]hmy dnpc work, an earlier iteration of this. more focused on human log parsing and so very fragile in the face of improving log messages, and not enough context in the human component.}(hj&h&hh'Nh)Nubah}(h]j&ah]h]h]h!]uh%hh'h(h)MhjP&h&hh}h}j&j&subh)}(hhh]h}(h]h]h]h]h!]hP](hsyslogindex-29hNtahuh%hh'h(h)MhjP&h&hubh)}(hhh]h}(h]h]h]h]h!]hj&uh%hhjP&h&hh'h(h)Mubh)}(h6syslog, systemd logging, linux kernel ringbuffer/dmesgh]h6syslog, systemd logging, linux kernel ringbuffer/dmesg}(hj&h&hh'Nh)Nubah}(h]j&ah]h]h]h!]uh%hh'h(h)MhjP&h&hh}h}j&j&subh)}(hhh]h}(h]h]h]h]h!]hP](hXMLindex-30hNtahuh%hh'h(h)MhjP&h&hubh)}(hhh]h}(h]h]h]h]h!]hj&uh%hhjP&h&hh'h(h)Mubh)}(h*buneman xml keys (mentioned above, c.2000)h]h*buneman xml keys (mentioned above, c.2000)}(hj&h&hh'Nh)Nubah}(h]j&ah]h]h]h!]uh%hh'h(h)MhjP&h&hh}h}j&j&subh)}(hmicrosoft power bi: As a simple example of how do we get this data into something actually novel for academia. Dashboard friendly.h]hmicrosoft power bi: As a simple example of how do we get this data into something actually novel for academia. Dashboard friendly.}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)MhjP&h&hubeh}(h]see-alsoah]h]see alsoah]h!]uh%h*hj$h&hh'h(h)Mubh+)}(hhh](h0)}(h%wheres the bottleneck - visualizationh]h%wheres the bottleneck - visualization}(hj!'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj'h&hh'h(h)Mubh)}(h_based on template analysis - but could be based on anything that can be grouped and identified.h]h_based on template analysis - but could be based on anything that can be grouped and identified.}(hj/'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhj'h&hubeh}(h]#wheres-the-bottleneck-visualizationah]h]%wheres the bottleneck - visualizationah]h!]uh%h*hj$h&hh'h(h)Mubh+)}(hhh](h0)}(h2Review of changes made so far to Parsl and Academyh]h2Review of changes made so far to Parsl and Academy}(hjH'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjE'h&hh'h(h)M ubh)}(hNThis should be part of understanding what sort of code changes I am proposing.h]hNThis should be part of understanding what sort of code changes I am proposing.}(hjV'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hjE'h&hubeh}(h]2review-of-changes-made-so-far-to-parsl-and-academyah]h]2review of changes made so far to parsl and academyah]h!]uh%h*hj$h&hh'h(h)M ubh+)}(hhh](h0)}(h"Applying this approach for academyh]h"Applying this approach for academy}(hjo'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hjl'h&hh'h(h)Mubh)}(hXAs an extreme "data might not be there" -- perhaps Parsl isn't there at all. What does this code and these techniques look like applied to a similar but very different codebase, Academy, which doesn't have any distributed monitoring at all at the moment. There are ~100 log lines in the academy codebase right now. How much can this be converted in a few hours, and then analysed in similar ways?h]hXAs an extreme “data might not be there” – perhaps Parsl isn’t there at all. What does this code and these techniques look like applied to a similar but very different codebase, Academy, which doesn’t have any distributed monitoring at all at the moment. There are ~100 log lines in the academy codebase right now. How much can this be converted in a few hours, and then analysed in similar ways?}(hj}'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjl'h&hubh)}(h~The point here being both considering this as a real logging direction for academy, and as a proof-of-generality beyond Parsl.h]h~The point here being both considering this as a real logging direction for academy, and as a proof-of-generality beyond Parsl.}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjl'h&hubh)}(h thoughts:h]h thoughts:}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)Mhjl'h&hubh)}(hacademy logging so far focused on looking pretty on the console: eg ANSI colour - that's at the opposite end of the spectrum to what this observability project is trying to log.h]hacademy logging so far focused on looking pretty on the console: eg ANSI colour - that’s at the opposite end of the spectrum to what this observability project is trying to log.}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M hjl'h&hubh)}(hxrule of thumb for initial conversion: whatever is substituted into the human message should be added as an extras field.h]hxrule of thumb for initial conversion: whatever is substituted into the human message should be added as an extras field.}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M$hjl'h&hubeh}(h]"applying-this-approach-for-academyah]h]"applying this approach for academyah]h!]uh%h*hj$h&hh'h(h)Mubeh}(h]the-restah]h]the restah]h!]uh%h*hhh&hh'h(h)M8ubh+)}(hhh](h0)}(hAcknowledgementsh]hAcknowledgements}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%h/hj'h&hh'h(h)M(ubh)}(hchronolog: nishchay, innah]hchronolog: nishchay, inna}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M*hj'h&hubh)}(h/desc: esp david adams, tom glanzman, jim chiangh]h/desc: esp david adams, tom glanzman, jim chiang}(hj'h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M,hj'h&hubh)}(h uiuc: vedh]h uiuc: ved}(hj(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M.hj'h&hubh)}(h gc: kevinh]h gc: kevin}(hj(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M0hj'h&hubh)}(hacademy: alok, greg, loganh]hacademy: alok, greg, logan}(hj(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M2hj'h&hubh)}(hdiaspora: Ryan, Haochenh]hdiaspora: Ryan, Haochen}(hj*(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M4hj'h&hubh)}(hparsl: matthewh]hparsl: matthew}(hj8(h&hh'Nh)Nubah}(h]h]h]h]h!]uh%hh'h(h)M6hj'h&hubeh}(h]acknowledgementsah]h]acknowledgementsah]h!]uh%h*hhh&hh'h(h)M(ubeh}(h]h]h]h]h!]sourceh(translation_progress}(totalK translatedKuuh%hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(outputNh/N generatorN datestampN root_prefix/ source_linkN source_urlN toc_backlinksentryfootnote_backlinks sectnum_xformstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerjx(error_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh( _destinationN _config_files]file_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}(h]haj*]j0aj]jaj]jaj9]j/aj#]jaj+]j/aja]jcaji ]jk aj ]j aj ]j aj ]j aj ]j aj]jajy]jaj]jaj8]j:ajw]j}aj]jaj]jaj]j"ajp]jraj]jaj]jajc]jiaj]jaj]jajD]j:aj]jaj]jaj]jaj]jaj ]j ajl&]jn&aj&]j&aj&]j&aj&]j&aunameids}(hkhhjGjDj j jejbjjjjjjjQjNjjj?j<j j9j j jjj)j&jj#jjjjj j jjj j j j j j j j j(j j'j$j j j j j j jg jd j j j* j' j_ j\ jZjWj j jRjOjcj`jjjXjUjjjjjSjjRjOjjjjjjj!jjjjDjAjjjjjjjJjGjO!jDjN!jK!jjjTjQjjjjj(j%j jjjjvjsj;j8jnjkjfjcjjjjj j j> j; je jb j j j!j!jF!jC!j$j$j3"j0"j"j"j#j#j'j'jM&jJ&j'j'jB'j?'ji'jf'j'j'jK(jH(u nametypes}(hkjGj jejjjjQjj?j j jj)jjjj jj j j j j(j'j j j jg j j* j_ jZj jRjcjjXjjjSjRjjjj!jjDjjjjJjO!jN!Z&jjTjjj(j jjvj;jnjfjjj j> je j j!jF!j$j3"j"j#j'jM&j'jB'ji'j'jK(uh}(hhh,jDhnhhj*j9j jjjjbjjjhjjjjjNjjjTjjj<j"j9jJj jJjj[j&jj#j,jj,jjj jjjj+j8jajlj jj j ji jt j j j j j j% j j j j j$j j j j j+ j j j j jd j j jm j' j j\ j- jWjj j j jOj jj!j`j]jyjjjjjfjUjj8jCjj[jwjjjjjjj-jOj-jjjjjjjj+jpj{jjjj$jAjjjjjjjGjcjrjjjjjjjGj jDjXjK!jXjjjjjQj,jjWjjj%jjj:jj+jsjj8jjkj>jjjcjjjyjjjjtjjj jj; j jb jA j jh j j j!j jC!j!j$jT!j0"js!j"j6"j#j"j'j$jJ&j$j'jP&jl&jw&j&j&j&j&j&j&j?'j'jf'jE'j'jl'jH(j'u footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}Rparse_messages]transform_messages](h system_message)}(hhh]h)}(hhh]h-Hyperlink target "index-0" is not referenced.}hj)sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypeINFOsourceh(lineKuh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-1" is not referenced.}hj )sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineK!uh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-2" is not referenced.}hj:)sbah}(h]h]h]h]h!]uh%hhj7)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineK:uh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-3" is not referenced.}hjT)sbah}(h]h]h]h]h!]uh%hhjQ)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineKuh%j)ubj))}(hhh]h)}(hhh]h/Hyperlink target "datamodel" is not referenced.}hjn)sbah}(h]h]h]h]h!]uh%hhjk)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineKuh%j)ubj))}(hhh]h)}(hhh]h1Hyperlink target "partialdata" is not referenced.}hj)sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineKuh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-4" is not referenced.}hj)sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineKuh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-5" is not referenced.}hj)sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineKuh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-6" is not referenced.}hj)sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineM uh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-7" is not referenced.}hj)sbah}(h]h]h]h]h!]uh%hhj)ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineM7uh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-8" is not referenced.}hj *sbah}(h]h]h]h]h!]uh%hhj*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMNuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "creating" is not referenced.}hj$*sbah}(h]h]h]h]h!]uh%hhj!*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMQuh%j)ubj))}(hhh]h)}(hhh]h-Hyperlink target "index-9" is not referenced.}hj>*sbah}(h]h]h]h]h!]uh%hhj;*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMmuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-10" is not referenced.}hjX*sbah}(h]h]h]h]h!]uh%hhjU*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-11" is not referenced.}hjr*sbah}(h]h]h]h]h!]uh%hhjo*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-12" is not referenced.}hj*sbah}(h]h]h]h]h!]uh%hhj*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-13" is not referenced.}hj*sbah}(h]h]h]h]h!]uh%hhj*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-14" is not referenced.}hj*sbah}(h]h]h]h]h!]uh%hhj*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-15" is not referenced.}hj*sbah}(h]h]h]h]h!]uh%hhj*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h,Hyperlink target "moving" is not referenced.}hj*sbah}(h]h]h]h]h!]uh%hhj*ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-16" is not referenced.}hj+sbah}(h]h]h]h]h!]uh%hhj +ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-17" is not referenced.}hj(+sbah}(h]h]h]h]h!]uh%hhj%+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-18" is not referenced.}hjB+sbah}(h]h]h]h]h!]uh%hhj?+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineM*uh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-19" is not referenced.}hj\+sbah}(h]h]h]h]h!]uh%hhjY+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineM/uh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-20" is not referenced.}hjv+sbah}(h]h]h]h]h!]uh%hhjs+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMCuh%j)ubj))}(hhh]h)}(hhh]h:Hyperlink target "pytest-observes-logs" is not referenced.}hj+sbah}(h]h]h]h]h!]uh%hhj+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMIuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-21" is not referenced.}hj+sbah}(h]h]h]h]h!]uh%hhj+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMPuh%j)ubj))}(hhh]h)}(hhh]h/Hyperlink target "analysing" is not referenced.}hj+sbah}(h]h]h]h]h!]uh%hhj+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMjuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-22" is not referenced.}hj+sbah}(h]h]h]h]h!]uh%hhj+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMxuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-23" is not referenced.}hj+sbah}(h]h]h]h]h!]uh%hhj+ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineM;uh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-24" is not referenced.}hj,sbah}(h]h]h]h]h!]uh%hhj,ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMduh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-25" is not referenced.}hj,,sbah}(h]h]h]h]h!]uh%hhj),ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-26" is not referenced.}hjF,sbah}(h]h]h]h]h!]uh%hhjC,ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-27" is not referenced.}hj`,sbah}(h]h]h]h]h!]uh%hhj],ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-28" is not referenced.}hjz,sbah}(h]h]h]h]h!]uh%hhjw,ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-29" is not referenced.}hj,sbah}(h]h]h]h]h!]uh%hhj,ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ubj))}(hhh]h)}(hhh]h.Hyperlink target "index-30" is not referenced.}hj,sbah}(h]h]h]h]h!]uh%hhj,ubah}(h]h]h]h]h!]levelKtypej)sourceh(lineMuh%j)ube transformerN include_log] decorationNh&hub.