SwiftScript: a language for loosely coupled large scale scientific workflows

Ben Clifford
benc@hawaga.org.uk

Presented at
Universiteit Utrecht Software Technology Colloquium
Feb 18th 2010

Subsequently amended

Press space or PgDown or right-arrow for the next slide

Outline

About me

Ben Clifford

benc@hawaga.org.uk

Formerly a staff programmer working on Swift, at the University of Chicago Computation Institute (2006-2009)

Swift and e-Science

Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.

For example:

Cancer research: looking for previously unknown protein changes by comparing mass spectrum data with data known about proteome.

A monte-carlo simulation of protein folding, 10 proteins, 1000 simulations for each configuration, inside simulated annealing algorithm with 2x5=10 different parameter values. Each simulation component takes ~ 5 CPU-minutes, so about ~ 1 CPU-year for a whole run; producing 10...100Gb of data.

Target programmers

Scientific programmers use some science-domain specific language to write the "science" bit of their application (eg R for statistics, Root for particle physics).

They aren't "high performance" or "distributed system" programmers.

Want to help them use "big" systems to run their application - eg machines with 10^5 CPU cores.

Traditional MPI (Message Passing Interface) is hard to think about.

Swift tries to provide an easier model that still allows many applications to be expressed, and performed with reasonable efficiency.

SwiftScript is the language for programming in that model.

A brief history of SwiftScript

~2003: VDL - the Virtual Data Language, part of the Virtual Data System
express directed acyclic graphs of unix processes
processes take input and produce output through files
'virtual data' - when needed, materialise data either by copying from elsewhere or by deriving it from other data that is available
Lots of thinking about "graph transformations"

~2006: VDL2 (which became SwiftScript)
key feature: iterating over collections of files in the language
... but it also accidentally became (I think) Turing-complete

~2010: still going - language tweaks, scaling improvements

The language

This isn't a language tutorial.

I won't explain all the "obvious" looking language constructs like 1+1. Most of that is like "any other language" (though please ask questions).

Two main areas of interest are:

An example: sorting and counting lines in a text file

$ cat example.txt 
cat
cat
dog
cat
fish
fish
cat
cat

$ cat example.txt | sort | uniq -c
   5 cat
   1 dog
   2 fish

In real life, sort and uniq would be science application code taking many seconds or minutes to run (~5 mins in protein example), data files might be megabytes each, and we'd be running many more than 2 executables.

sort | uniq -c in SwiftScript

type file;

app (file o) sortfile(file i) {
  sort stdin=@i stdout=@o;
}

app (file o) count(file i) {
  uniq "-c" stdin=@i stdout=@o;
}

file input <"example.txt">; // must exist
file intermediate;
file output <"output.txt">;

intermediate = sortfile(input);
output = count(intermediate);

Running this example

$ swift example.swift
Swift svn swift-r3011 (swift modified locally) cog-r2410

RunID: 20100214-1510-ve05qoe4
Progress: 
Final status:  Finished successfully:2
$ cat output.txt 
   5 cat
   1 dog
   2 fish

Mappers and file types

file output <"output.txt">;

Declares output to be a variable whose value is stored in the file system rather than in-core.

<"output.txt"> means that the value is stored in a file output.txt (this can be a URL)

This is a simple example with a literal single filename. More complex syntax allows mapping arrays of files (shown later), with more dynamic behaviour (eg generating filename patterns at runtime)

We can omit the <...> mapping expression in which case Swift will make up a filename - useful for intermediate files.

app procedures

app (file o) count(file i) {
  uniq "-c" stdin=@i stdout=@o;
}

This is how the real work gets done - by getting some other science-domain specific program to do it.

app procedures execute unix processes, but not like system() or runProcess

The environment in which an app procedure runs is constrainted:
Application will start in "some directory, somewhere". There, it will find its input files, and there it should leave its output files.

Applications need to be referentially transparent (but SwiftScript doesn't clearly define what equivalence is)

Executing an app {} procedure

(it doesn't always work exactly like this, but this is good enough)

  1. Pick an execution site
  2. Transfer input files there (if they are not already cached there)
  3. Put the job in an execution queue at the execution site
  4. Wait for execution to finish
  5. Transfer output files back
  6. Check everything worked ok

app abstraction facilitates (1/2):
flexibility in execution site

There are many different execution resources in the world: clusters on your campus, supercomputers, your own laptop.

It is useful to be able to choose and switch between sites, and choose between different mechanisms for accessing a site, because:

app abstraction facilitates (2/2):
reliability mechanisms

Failure happens a lot in our target environments (integer percentages in some environments) so reliability is not "a nice feature to have" - it is essential.

Retries: if an application execution fails, we try it 2 more times

Restarts: if retries fail, then the whole script fails (eventually). Maybe want to restart manually where we left off. Assume that app blocks are expensive and everything else is cheap, so start the script from beginning again, skipped apps that we've already run (using a log file)

Replication: deals with a softer class of failure. Sometimes an app goes into a queue and sits "forever" (really forever, or perhaps much longer than most other apps). We can launch a new attempt to run the app, without killing the original. When one starts, we kill the other(s)

Invoking

So back to our example application... we've defined sortfile and count as unix programs, we've mapped input and output, and leave intermediate for Swift to map.

So now we can actually run something, using function-like syntax...

intermediate = sortfile(input);
output = count(intermediate);

... which hides all of the remote execution and reliability behaviour.

sort | uniq -c again...

type file;

app (file o) sortfile(file i) {
  sort stdin=@i stdout=@o;
}

app (file o) count(file i) {
  uniq "-c" stdin=@i stdout=@o;
}

file input <"example.txt">; // must exist
file intermediate;
file output <"output.txt">;

intermediate = sortfile(input);
output = count(intermediate);

The other key area of interest is...

Massive implicit parallelism

We can declare a mapped array of files: (eg mydata.*.img)

file inp[] <simple_mapper;
                     prefix="mydata.",
                     suffix=".img">;

and iterate over it:

foreach s,i in inp {
  out[i] = f(s); // same as out[i] = f(inp[i]);
}

All iterations can happen in parallel (subject to runtime limits, but could be many thousands of CPUs)

In real use, f might be an app procedure taking 30s, with 10^5 loop iterations.

Execution order is data dependency order

Everything can be executed in parallel, except where there are dataflow dependencies.

Dataflow dependencies are expressed by single assignment variables:

 int a;
 int b;
 int c;
 a = f(c);
 b = g(6);
 c = h(7);

Execution of f will be after h. Execution of g will be unordered wrt f and h.

Extends into (non-app) procedures.
Only concurrency control in SwiftScript - no locks, etc.
Assignment can be "in memory" or giving a file its content.

Arrays are not single-assignment

int a[];
int b;
a[0] = 128;
a[1] = 129;
b = sum(a); // pass in the whole array

a is not single assignment. But the elements of a are.

Static analysis of code to see which statements might write to a. When all potential writers are finished, then the array is "closed" for writing. Cannot modify an element once it has been assigned.

Arrays are "monotonic" - we know more over time, and once we know something, it is true forever. A weaker form of single assignment.

Array deadlock (and other problems)

But there are deadlocks (in practice, and maybe in theory?):

int a[];
foreach i in [1:10] {
 if (i < 9) {
   a[i] = 5;
 } else { // i==10
   int b;
   b = f(a);
 }
}

a will be closed when the whole foreach is finished... but the foreach will never finish because f(a) is waiting for a to close.

Leads to programmer confusion when overly conservative

More static+runtime analysis? Better structures/iterators? (map-like?)

NEWSFLASH: this morning, this particular deadlock was fixed (but others remain)

Performance

Metrics are related to scientific computing focus.

Mostly, what was done with app procedures by an application:

How many CPU-hours in total? (eg 208763 CPU-hours)

How many CPUs in use simultaneously? (eg 2000 CPUs)

In terms of language execution, interested in raw SwiftScript speed where it impedes the above: can we sustain 100 app block launches per second?

How short can you make your SwiftScript program? (so interesting to see how *few* lines of code are written in SwiftScript...)

Lots of stats in IEEE Computer article.

Less concrete idea (1): Provenance

provenance = record of the history of an artifact to help convince you that it is genuine/valuable

In Swift: record what output files were generated - which input files, which programs, where programs were run

Which datafiles used this site? (because we must discard any results from it)

Regenerate interesting results because we've damaged our copy

Functional/dataflow style helps there but many other issues.

Prototype implementation

Less concrete idea (2): Streaming Datasets

Processing datasets that grow over time - eg a database of fMRI images that is added to as new patients are seen.

Represent the database as a mapped array that is never closed.
We can iterate with foreach over that array, and leave the SwiftScript program running "forever"

Maybe no need to change the language definition much / at all

No implementation, only some mailinglist chatter

As an embedded language or library vs own language

Q: why did you implement this as a new language rather than embedding?

A: An accident of history - we started off making a glorified DAG description language, not a "real programming langauge"

But we can still wonder whether it would be a better or worse idea...

How would we implement:

for example: in Haskell or Java

Avoiding boilerplate for futures

Want to program using expression syntax closely connected to the high level application; avoid boilerplate mechanics.

SwiftScript:

int p = 5 + f();
int q = 100 + p;

Java?

Future rhs = f(); // assume returns fast
Future lhs = constant(5); // lift pure value
Future p = add(lhs,rhs); // add is future-aware +

Haskell?

p <- f `addM` (return 5)
let q = 100 + p  -- different from line above, but why?

Types

Prototype of type inference (part of a very successful Google Summer of Code project to strengthen Swift type checking)

'Street programmers' are not used to debugging type inference errors - especially the non-locality of error messages. This is a usability problem. So maybe type inference is a bad thing here?

But removing type annotations is 'scripty' - a good thing.

Do we have to have static (possibly inferred) types? No - its a dogma from early on in the development of SwiftScript that may or may not still be relevant.

Embedding would give more libraries

In the beginning we were expecting people to make things that looked almost like DAGs of programs but "a bit more interesting."

Now people want to do sin and cos and in-memory matrix multiplication

Its a hassle to add wire in existing libraries in other langauges for every new feature.

The implementation is not really suited to in-core data processing - even a single integer has a very large footprint (because originally our 'values' were mostly many-megabyte files, where overhead mattered less)

Areas that I've seen: parsing/printing data files; matrix operations; sin/cos

Would be great to easily import some other languages library collection

Summary of language

Abstracting files and application execution works very well for hiding away the dirty guts of what is going on.

Ugly in many places, but definitely good enough for what we originally intended.

Surprising (in a good way) that people want to do richer things, but both the language and the implementation have difficulty.

Credits and more info

SWFT group at University of Chicago Computation Institute

Swift website: http://www.ci.uchicago.edu/swift/

IEEE Computer article: Parallel scripting for applications at the petascale and beyond - November 2009; most numbers in this presentation are from there

More TODOs: syntax is fairly random (eg ambiguities and redundancies, although much tidier than before) - that comes from "prototype" syntax that we never tidied up. (similar reason to types being a bit of a mess)

Emphasis and define 'loosely coupled' (as being file-coupled, with large execution times on the order of minutes of components, and with communication only happening at those points)

In comparison with embedded DSL: would be nice to have bigger library, as people shift more of their application from the science-language to SwiftScript