Swift and e-Science

Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.

For example:

Cancer research: looking for previously unknown protein changes by comparing mass spectrum data with data known about proteome.

A monte-carlo simulation of protein folding, 10 proteins, 1000 simulations for each configuration, inside simulated annealing algorithm with 2x5=10 different parameter values. Each simulation component takes ~ 5 CPU-minutes, so about ~ 1 CPU-year for a whole run; producing 10...100Gb of data.

Target programmers

Scientific programmers use some science-domain specific language to write the "science" bit of their application (eg R for statistics, Root for particle physics).

They aren't "high performance" or "distributed system" programmers.

Want to help them use "big" systems to run their application - eg machines with 10^5 CPU cores.

Traditional MPI (Message Passing Interface) is hard to think about.

Swift tries to provide an easier model that still allows many applications to be expressed, and performed with reasonable efficiency.

SwiftScript is the language for programming in that model.

A brief history of SwiftScript

~2003: VDL - the Virtual Data Language, part of the Virtual Data System
express directed acyclic graphs of unix processes
processes take input and produce output through files
'virtual data' - when needed, materialise data either by copying from elsewhere or by deriving it from other data that is available
Lots of thinking about "graph transformations"

~2006: VDL2 (which became SwiftScript)
key feature: iterating over collections of files in the language
... but it also accidentally became (I think) Turing-complete

~2010: still going - language tweaks, scaling improvements

Post-presentation notes (click to expand)
- Actually development of the VDS project forked around 2006 - one direction was Swift, and the other (Pegasus) continued in the graph-based direction
- Doaitse Swierstra sent this link, saying "It seems as if it was written especially for you": Construct Your Own Favorite Programming Language

The language

This isn't a language tutorial.

I won't explain all the "obvious" looking language constructs like 1+1. Most of that is like "any other language" (though please ask questions).

Two main areas of interest are:

Files and unix programs as in-language constructs
Massive implicit parallelism

An example: sorting and counting lines in a text file

$ cat example.txt 
cat
cat
dog
cat
fish
fish
cat
cat

$ cat example.txt | sort | uniq -c
   5 cat
   1 dog
   2 fish

In real life, sort and uniq would be science application code taking many seconds or minutes to run (~5 mins in protein example), data files might be megabytes each, and we'd be running many more than 2 executables.

`sort | uniq -c` in SwiftScript

type file;

app (file o) sortfile(file i) {
  sort stdin=@i stdout=@o;
}

app (file o) count(file i) {
  uniq "-c" stdin=@i stdout=@o;
}

file input <"example.txt">; // must exist
file intermediate;
file output <"output.txt">;

intermediate = sortfile(input);
output = count(intermediate);

Running this example

$ swift example.swift

Swift svn swift-r3011 (swift modified locally) cog-r2410

RunID: 20100214-1510-ve05qoe4
Progress:

Final status:  Finished successfully:2

$ cat output.txt 
   5 cat
   1 dog
   2 fish

Mappers and file types

file output <"output.txt">;

Declares output to be a variable whose value is stored in the file system rather than in-core.

<"output.txt"> means that the value is stored in a file output.txt (this can be a URL)

This is a simple example with a literal single filename. More complex syntax allows mapping arrays of files (shown later), with more dynamic behaviour (eg generating filename patterns at runtime)

We can omit the <...> mapping expression in which case Swift will make up a filename - useful for intermediate files.

Post-presentation notes (click to expand)
- Mappers can actually be much more complicated than this (a later slide shows how they can be used for arrays, mapping multiple files), and one line of thinking suggests that they should somehow be normal functions rather than something special...

`app` procedures

app (file o) count(file i) {
  uniq "-c" stdin=@i stdout=@o;
}

This is how the real work gets done - by getting some other science-domain specific program to do it.

app procedures execute unix processes, but not like system() or runProcess

The environment in which an app procedure runs is constrainted:
Application will start in "some directory, somewhere". There, it will find its input files, and there it should leave its output files.

Applications need to be referentially transparent (but SwiftScript doesn't clearly define what equivalence is)

Executing an app {} procedure

(it doesn't always work exactly like this, but this is good enough)

Pick an execution site
Transfer input files there (if they are not already cached there)
Put the job in an execution queue at the execution site
Wait for execution to finish
Transfer output files back
Check everything worked ok

`app` abstraction facilitates (1/2):
flexibility in execution site

There are many different execution resources in the world: clusters on your campus, supercomputers, your own laptop.

It is useful to be able to choose and switch between sites, and choose between different mechanisms for accessing a site, because:

your usual site is broken today
someone is developing a better mechanism (higher performance) for submitting to your usual site (ongoing r&d there)
you want to use the combined power of several sites at once (research question: if many sites available, which is best to use?)

Post-presentation notes (click to expand)
- In practice, though, there are deployment issues when getting a new application running on a new site when there is insufficient infrastructure there - for example, if you want to run a script that uses certain libraries or programs that are standard in your field of science, but not standard in general (for example, if you have written code in R - its unlikely to find R installed on a random site). There are various approaches to dealing with that, that do not overlap much/at all with the SwiftScript side of things.
  So what happens is that people tend to gravitate to getting things running on a small number (1 or 2) of sites that are fairly reliable (in terms of not being broken, and in terms of providing sufficient compute capacity).

`app` abstraction facilitates (2/2):
reliability mechanisms

Failure happens a lot in our target environments (integer percentages in some environments) so reliability is not "a nice feature to have" - it is essential.

Retries: if an application execution fails, we try it 2 more times

Restarts: if retries fail, then the whole script fails (eventually). Maybe want to restart manually where we left off. Assume that app blocks are expensive and everything else is cheap, so start the script from beginning again, skipped apps that we've already run (using a log file)

Replication: deals with a softer class of failure. Sometimes an app goes into a queue and sits "forever" (really forever, or perhaps much longer than most other apps). We can launch a new attempt to run the app, without killing the original. When one starts, we kill the other(s)

Post-presentation notes (click to expand)
- This way of implementing restarts looks something like persistent memoisation (as suggested by Doaitse Swierstra), and begins to overlap with the description of later provenance work: if the input parameters truly are what describe everything relevant about the output, then perhaps the same information used for restarts could be used to optimise away parts of other Swift runs. But its hard to describe the input (and the function itself) in some way that is meaningful outside of a single source file and fixed input data set.
- The assumption that "app blocks are expensive and everything else is cheap" is not such a good assumption as SwiftScript programs become larger. In the past it was a problem with large machine generated SwiftScripts (where the compiler took minutes to run). Its also a problem where mappers are expensive (mail thread), and in general will be a problem wherever real computation is done inside Swift.

Invoking

So back to our example application... we've defined sortfile and count as unix programs, we've mapped input and output, and leave intermediate for Swift to map.

So now we can actually run something, using function-like syntax...

intermediate = sortfile(input);
output = count(intermediate);

... which hides all of the remote execution and reliability behaviour.

`sort | uniq -c` again...

type file;

app (file o) sortfile(file i) {
  sort stdin=@i stdout=@o;
}

app (file o) count(file i) {
  uniq "-c" stdin=@i stdout=@o;
}

file input <"example.txt">; // must exist
file intermediate;
file output <"output.txt">;

intermediate = sortfile(input);
output = count(intermediate);

The other key area of interest is...

Massive implicit parallelism

We can declare a mapped array of files: (eg mydata.*.img)

file inp[] <simple_mapper;
                     prefix="mydata.",
                     suffix=".img">;

and iterate over it:

foreach s,i in inp {
  out[i] = f(s); // same as out[i] = f(inp[i]);
}

All iterations can happen in parallel (subject to runtime limits, but could be many thousands of CPUs)

In real use, f might be an app procedure taking 30s, with 10^5 loop iterations.

Execution order is data dependency order

Everything can be executed in parallel, except where there are dataflow dependencies.

Dataflow dependencies are expressed by single assignment variables:

 int a;
 int b;
 int c;
 a = f(c);
 b = g(6);
 c = h(7);

Execution of f will be after h. Execution of g will be unordered wrt f and h.

Extends into (non-app) procedures.
Only concurrency control in SwiftScript - no locks, etc.
Assignment can be "in memory" or giving a file its content.

Post-presentation notes (click to expand)
- Although it is intended that the SwiftScript programs will commonly use many (thousands of) cores, a SwiftScript program can always be executed in a single threaded mode where only a single process executes at once. This will not result in deadlocks (at least, no more than already exist in the language...).

Arrays are not single-assignment

int a[];
int b;
a[0] = 128;
a[1] = 129;
b = sum(a); // pass in the whole array

a is not single assignment. But the elements of a are.

Static analysis of code to see which statements might write to a. When all potential writers are finished, then the array is "closed" for writing. Cannot modify an element once it has been assigned.

Arrays are "monotonic" - we know more over time, and once we know something, it is true forever. A weaker form of single assignment.

Post-presentation notes (click to expand)
- The term monotonic comes from looking at partial order programming, for example http://www.cs.ucla.edu/~stott/pop/. Although partially known values are deliberately not exposed to the programmer, I've found it useful to think about whats going on whilst a program is running; a single assignment variable has all of its valid values as maximal elements and as unique minimal element, bottom, the unknown value. An array has a richer structure: for example [1,?,100] <= [1,10,100], and the maximal elements are closed arrays.
- Are arrays 0-indexed or 1-indexed? Neither. SwiftScript arrays can have elements with any positive integer. There is no requirement that the integers form a continuous range - for example, you can have an array with two elements, a[1] and a[76]. This makes it confusing to ask "what is the laste element in an array?" because that question has multiple meanings: "if I want to sequentially refer to each element in the array, I should run an integer sequence over what range of values?" (in which case the answer is, "use foreach"); "I want to access the last element in the array" (in which case there is no clear answer).

Array deadlock (and other problems)

But there are deadlocks (in practice, and maybe in theory?):

int a[];
foreach i in [1:10] {
 if (i < 9) {
   a[i] = 5;
 } else { // i==10
   int b;
   b = f(a);
 }
}

a will be closed when the whole foreach is finished... but the foreach will never finish because f(a) is waiting for a to close.

Leads to programmer confusion when overly conservative

More static+runtime analysis? Better structures/iterators? (map-like?)

NEWSFLASH: this morning, this particular deadlock was fixed (but others remain)

Post-presentation notes (click to expand)
- The library function @filename can give the filename associated with a mapped array whether its an input or an output. There is another library function @filenames which is not so straightforward. If you ask for @filenames(a) for some mappers, the number of files is not always known, so you can't genate the filenames list ahead of time. (for example, a mapper that maps array elements to file0000.txt, file0001.txt, ...
  This is part of a broader weakness that app procedures cannot return a variable number of files.
  The mapper interface contains something of a hack, defining a mapper with as static or dynamic. When a mapper is static, its possible to use @filenames to get the filenames of a mapped output array.

Performance

Metrics are related to scientific computing focus.

Mostly, what was done with app procedures by an application:

How many CPU-hours in total? (eg 208763 CPU-hours)

How many CPUs in use simultaneously? (eg 2000 CPUs)

In terms of language execution, interested in raw SwiftScript speed where it impedes the above: can we sustain 100 app block launches per second?

How short can you make your SwiftScript program? (so interesting to see how *few* lines of code are written in SwiftScript...)

Lots of stats in IEEE Computer article.

Less concrete idea (1): Provenance

provenance = record of the history of an artifact to help convince you that it is genuine/valuable

In Swift: record what output files were generated - which input files, which programs, where programs were run

Which datafiles used this site? (because we must discard any results from it)

Regenerate interesting results because we've damaged our copy

Functional/dataflow style helps there but many other issues.

Prototype implementation

Post-presentation notes (click to expand)
- The theoretical model for this is closely related to the open provenance model.

Less concrete idea (2): Streaming Datasets

Processing datasets that grow over time - eg a database of fMRI images that is added to as new patients are seen.

Represent the database as a mapped array that is never closed.
We can iterate with foreach over that array, and leave the SwiftScript program running "forever"

Maybe no need to change the language definition much / at all

No implementation, only some mailinglist chatter

As an embedded language or library vs own language

Q: why did you implement this as a new language rather than embedding?

A: An accident of history - we started off making a glorified DAG description language, not a "real programming langauge"

But we can still wonder whether it would be a better or worse idea...

How would we implement:

Out-of-core data and applications
Massive (10^5) multithreading and everything-is-a-future style

for example: in Haskell or Java

Avoiding boilerplate for futures

Want to program using expression syntax closely connected to the high level application; avoid boilerplate mechanics.

SwiftScript:

int p = 5 + f();
int q = 100 + p;

Java?

Future rhs = f(); // assume returns fast
Future lhs = constant(5); // lift pure value
Future p = add(lhs,rhs); // add is future-aware +

Haskell?

p <- f `addM` (return 5)
let q = 100 + p  -- different from line above, but why?

Types

Prototype of type inference (part of a very successful Google Summer of Code project to strengthen Swift type checking)

'Street programmers' are not used to debugging type inference errors - especially the non-locality of error messages. This is a usability problem. So maybe type inference is a bad thing here?

But removing type annotations is 'scripty' - a good thing.

Do we have to have static (possibly inferred) types? No - its a dogma from early on in the development of SwiftScript that may or may not still be relevant.

Post-presentation notes (click to expand)
- Functions/procedures aren't strongly typed - for example, @strcat will take any number of parameters, and each can be of any type (that can be reasonably represented as a string - for example, a numeric type). This is useful from a scripty perspective, but awkwardly fits the model of having strong typing.

Embedding would give more libraries

In the beginning we were expecting people to make things that looked almost like DAGs of programs but "a bit more interesting."

Now people want to do sin and cos and in-memory matrix multiplication

Its a hassle to add wire in existing libraries in other langauges for every new feature.

The implementation is not really suited to in-core data processing - even a single integer has a very large footprint (because originally our 'values' were mostly many-megabyte files, where overhead mattered less)

Areas that I've seen: parsing/printing data files; matrix operations; sin/cos

Would be great to easily import some other languages library collection

Post-presentation notes (click to expand)
- More recently, @java imports have been implemented which allow arbitrary java code to be called

Summary of language

Abstracting files and application execution works very well for hiding away the dirty guts of what is going on.

Ugly in many places, but definitely good enough for what we originally intended.

Surprising (in a good way) that people want to do richer things, but both the language and the implementation have difficulty.

Credits and more info

SWFT group at University of Chicago Computation Institute

Swift website: http://www.ci.uchicago.edu/swift/

IEEE Computer article: Parallel scripting for applications at the petascale and beyond - November 2009; most numbers in this presentation are from there

More TODOs: syntax is fairly random (eg ambiguities and redundancies, although much tidier than before) - that comes from "prototype" syntax that we never tidied up. (similar reason to types being a bit of a mess)

Emphasis and define 'loosely coupled' (as being file-coupled, with large execution times on the order of minutes of components, and with communication only happening at those points)

In comparison with embedded DSL: would be nice to have bigger library, as people shift more of their application from the science-language to SwiftScript

SwiftScript: a language for loosely coupled large scale scientific workflows

Outline

About me

Swift and e-Science

Target programmers

A brief history of SwiftScript

The language

An example: sorting and counting lines in a text file

`sort | uniq -c` in SwiftScript

Running this example

Mappers and file types

`app` procedures

Executing an app {} procedure

`app` abstraction facilitates (1/2):
flexibility in execution site

`app` abstraction facilitates (2/2):
reliability mechanisms

Invoking

`sort | uniq -c` again...

Massive implicit parallelism

Execution order is data dependency order

Arrays are not single-assignment

Array deadlock (and other problems)

Performance

Less concrete idea (1): Provenance

Less concrete idea (2): Streaming Datasets

As an embedded language or library vs own language

Avoiding boilerplate for futures

Types

Embedding would give more libraries

Summary of language

Credits and more info

SwiftScript: a language for loosely coupled large scale scientific workflows

Outline

About me

Swift and e-Science

Target programmers

A brief history of SwiftScript

The language

An example: sorting and counting lines in a text file

sort | uniq -c in SwiftScript

Running this example

Mappers and file types

app procedures

Executing an app {} procedure

app abstraction facilitates (1/2):flexibility in execution site

app abstraction facilitates (2/2):reliability mechanisms

Invoking

sort | uniq -c again...

Massive implicit parallelism

Execution order is data dependency order

Arrays are not single-assignment

Array deadlock (and other problems)

Performance

Less concrete idea (1): Provenance

Less concrete idea (2): Streaming Datasets

As an embedded language or library vs own language

Avoiding boilerplate for futures

Types

Embedding would give more libraries

Summary of language

Credits and more info

`sort | uniq -c` in SwiftScript

`app` procedures

`app` abstraction facilitates (1/2):
flexibility in execution site

`app` abstraction facilitates (2/2):
reliability mechanisms

`sort | uniq -c` again...