Ben Clifford
benc@hawaga.org.uk
Presented at
Universiteit Utrecht Software Technology Colloquium
Feb 18th 2010
Subsequently amended
Press space or PgDown or right-arrow for the next slide
Ben Clifford
Formerly a staff programmer working on Swift, at the University of Chicago Computation Institute (2006-2009)
Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.
For example:
Cancer research: looking for previously unknown protein changes by comparing mass spectrum data with data known about proteome.
A monte-carlo simulation of protein folding, 10 proteins, 1000 simulations for each configuration, inside simulated annealing algorithm with 2x5=10 different parameter values. Each simulation component takes ~ 5 CPU-minutes, so about ~ 1 CPU-year for a whole run; producing 10...100Gb of data.
Scientific programmers use some science-domain specific language to write the "science" bit of their application (eg R for statistics, Root for particle physics).
They aren't "high performance" or "distributed system" programmers.
Want to help them use "big" systems to run their application - eg machines with 10^5 CPU cores.
Traditional MPI (Message Passing Interface) is hard to think about.
Swift tries to provide an easier model that still allows many applications to be expressed, and performed with reasonable efficiency.
SwiftScript is the language for programming in that model.
~2003: VDL - the Virtual Data Language, part of the Virtual Data System
express directed acyclic graphs of unix processes
processes take input and produce output through files
'virtual data' - when needed, materialise data either by
copying from elsewhere or by deriving it from other data that is
available
Lots of thinking about "graph transformations"
~2006: VDL2 (which became SwiftScript)
key feature: iterating over collections of files in the language
... but it also accidentally became (I think) Turing-complete
~2010: still going - language tweaks, scaling improvements
This isn't a language tutorial.
I won't explain all the "obvious" looking language constructs like
1+1
. Most of that is like "any other language"
(though please ask questions).
Two main areas of interest are:
$ cat example.txt cat cat dog cat fish fish cat cat $ cat example.txt | sort | uniq -c 5 cat 1 dog 2 fish
In real life, sort
and uniq
would be
science application code taking many seconds or minutes to run
(~5 mins in protein example), data files might be megabytes each, and
we'd be running many more than 2 executables.
sort | uniq -c
in SwiftScripttype file; app (file o) sortfile(file i) { sort stdin=@i stdout=@o; } app (file o) count(file i) { uniq "-c" stdin=@i stdout=@o; } file input <"example.txt">; // must exist file intermediate; file output <"output.txt">; intermediate = sortfile(input); output = count(intermediate);
$ swift example.swift
Swift svn swift-r3011 (swift modified locally) cog-r2410 RunID: 20100214-1510-ve05qoe4 Progress:
Final status: Finished successfully:2
$ cat output.txt 5 cat 1 dog 2 fish
file output <"output.txt">;
Declares output
to be a variable whose value is stored in
the file system rather than in-core.
<"output.txt">
means that the value is stored in
a file output.txt
(this can be a URL)
This is a simple example with a literal single filename. More complex syntax allows mapping arrays of files (shown later), with more dynamic behaviour (eg generating filename patterns at runtime)
We can omit the <...>
mapping expression in which case
Swift will make up a filename - useful for intermediate files.
app
proceduresapp (file o) count(file i) { uniq "-c" stdin=@i stdout=@o; }
This is how the real work gets done - by getting some other science-domain specific program to do it.
app
procedures execute unix processes, but not like
system()
or runProcess
The environment in which an app
procedure runs is
constrainted:
Application will start in "some directory, somewhere". There, it will find its input files, and there it should leave its output files.
Applications need to be referentially transparent (but SwiftScript doesn't clearly define what equivalence is)
(it doesn't always work exactly like this, but this is good enough)
app
abstraction facilitates (1/2):There are many different execution resources in the world: clusters on your campus, supercomputers, your own laptop.
It is useful to be able to choose and switch between sites, and choose between different mechanisms for accessing a site, because:
app
abstraction facilitates (2/2):Failure happens a lot in our target environments (integer percentages in some environments) so reliability is not "a nice feature to have" - it is essential.
Retries: if an application execution fails, we try it 2 more times
Restarts: if retries fail, then the whole script fails (eventually).
Maybe want to restart manually where we left off. Assume that app
blocks are expensive and everything else is cheap, so start the script from beginning again, skipped apps that we've already run (using a log file)
Replication: deals with a softer class of failure. Sometimes an app goes into a queue and sits "forever" (really forever, or perhaps much longer than most other apps). We can launch a new attempt to run the app, without killing the original. When one starts, we kill the other(s)
app
blocks are expensive and
everything else is cheap" is not such a good assumption as SwiftScript
programs become larger. In the past it was a problem with large machine
generated SwiftScripts (where the compiler took minutes to run). Its also
a problem where mappers are expensive (mail thread), and in general
will be a problem wherever real computation is done inside Swift.
So back to our example application... we've defined sortfile
and count
as unix programs, we've mapped input
and output
, and leave intermediate
for Swift
to map.
So now we can actually run something, using function-like syntax...
intermediate = sortfile(input); output = count(intermediate);
... which hides all of the remote execution and reliability behaviour.
sort | uniq -c
again...type file; app (file o) sortfile(file i) { sort stdin=@i stdout=@o; } app (file o) count(file i) { uniq "-c" stdin=@i stdout=@o; } file input <"example.txt">; // must exist file intermediate; file output <"output.txt">; intermediate = sortfile(input); output = count(intermediate);
The other key area of interest is...
We can declare a mapped array of files: (eg mydata.*.img
)
file inp[] <simple_mapper; prefix="mydata.", suffix=".img">;
and iterate over it:
foreach s,i in inp { out[i] = f(s); // same as out[i] = f(inp[i]); }
All iterations can happen in parallel (subject to runtime limits, but could be many thousands of CPUs)
In real use, f
might be an app
procedure taking
30s, with 10^5 loop iterations.
Everything can be executed in parallel, except where there are dataflow dependencies.
Dataflow dependencies are expressed by single assignment variables:
int a; int b; int c; a = f(c); b = g(6); c = h(7);
Execution of f
will be after h
. Execution of
g
will be unordered wrt f
and h
.
Extends into (non-app
) procedures.
Only concurrency control in SwiftScript - no locks, etc.
Assignment can be "in memory" or giving a file its content.
int a[]; int b; a[0] = 128; a[1] = 129; b = sum(a); // pass in the whole array
a
is not single assignment. But the elements of a
are.
Static analysis of code to see which statements might write to a
. When all potential writers are finished, then the array is "closed" for writing. Cannot modify an element once it has been assigned.
Arrays are "monotonic" - we know more over time, and once we know something, it is true forever. A weaker form of single assignment.
a[1]
and a[76]
. This makes it confusing to ask "what is
the laste element in an array?" because that question has multiple meanings:
"if I want to sequentially refer to each element in the array, I should run
an integer sequence over what range of values?" (in which case the answer is,
"use foreach"); "I want to access the last element in the array" (in which
case there is no clear answer).But there are deadlocks (in practice, and maybe in theory?):
int a[]; foreach i in [1:10] { if (i < 9) { a[i] = 5; } else { // i==10 int b; b = f(a); } }
a
will be closed when the whole foreach
is
finished... but the foreach
will never finish because
f(a)
is waiting for a
to close.
Leads to programmer confusion when overly conservative
More static+runtime analysis? Better structures/iterators? (map-like?)
NEWSFLASH: this morning, this particular deadlock was fixed (but others remain)
@filename
can give the
filename associated with a mapped array whether its an input or an output.
There is another library function @filenames
which is not so
straightforward. If you ask for @filenames(a)
for some mappers,
the number of files is not always known, so you can't genate the filenames
list ahead of time. (for example, a mapper that maps array elements to
file0000.txt
, file0001.txt
, ...app
procedures cannot
return a variable number of files.@filenames
to get the filenames of a mapped output array.
Metrics are related to scientific computing focus.
Mostly, what was done with app
procedures by an application:
How many CPU-hours in total? (eg 208763 CPU-hours)
How many CPUs in use simultaneously? (eg 2000 CPUs)
In terms of language execution, interested in raw SwiftScript speed where
it impedes the above: can we sustain 100 app
block launches per
second?
How short can you make your SwiftScript program? (so interesting to see how *few* lines of code are written in SwiftScript...)
Lots of stats in IEEE Computer article.
provenance = record of the history of an artifact to help convince you that it is genuine/valuable
In Swift: record what output files were generated - which input files, which programs, where programs were run
Which datafiles used this site? (because we must discard any results from it)
Regenerate interesting results because we've damaged our copy
Functional/dataflow style helps there but many other issues.
Prototype implementation
Processing datasets that grow over time - eg a database of fMRI images that is added to as new patients are seen.
Represent the database as a mapped array that is never closed.
We can iterate with foreach
over that array, and leave the SwiftScript program running "forever"
Maybe no need to change the language definition much / at all
No implementation, only some mailinglist chatter
Q: why did you implement this as a new language rather than embedding?
A: An accident of history - we started off making a glorified DAG description language, not a "real programming langauge"
But we can still wonder whether it would be a better or worse idea...
How would we implement:
for example: in Haskell or Java
Want to program using expression syntax closely connected to the high level application; avoid boilerplate mechanics.
SwiftScript:
int p = 5 + f(); int q = 100 + p;
Java?
Futurerhs = f(); // assume returns fast Future lhs = constant(5); // lift pure value Future p = add(lhs,rhs); // add is future-aware +
Haskell?
p <- f `addM` (return 5) let q = 100 + p -- different from line above, but why?
Prototype of type inference (part of a very successful Google Summer of Code project to strengthen Swift type checking)
'Street programmers' are not used to debugging type inference errors - especially the non-locality of error messages. This is a usability problem. So maybe type inference is a bad thing here?
But removing type annotations is 'scripty' - a good thing.
Do we have to have static (possibly inferred) types? No - its a dogma from early on in the development of SwiftScript that may or may not still be relevant.
@strcat
will take any number of parameters, and each can be
of any type (that can be reasonably represented as a string - for example,
a numeric type). This is useful from a scripty perspective, but awkwardly
fits the model of having strong typing.In the beginning we were expecting people to make things that looked almost like DAGs of programs but "a bit more interesting."
Now people want to do sin
and cos
and
in-memory matrix multiplication
Its a hassle to add wire in existing libraries in other langauges for every new feature.
The implementation is not really suited to in-core data processing - even a single integer has a very large footprint (because originally our 'values' were mostly many-megabyte files, where overhead mattered less)
Areas that I've seen: parsing/printing data files; matrix operations; sin/cos
Would be great to easily import some other languages library collection
Abstracting files and application execution works very well for hiding away the dirty guts of what is going on.
Ugly in many places, but definitely good enough for what we originally intended.
Surprising (in a good way) that people want to do richer things, but both the language and the implementation have difficulty.
SWFT group at University of Chicago Computation Institute
Swift website: http://www.ci.uchicago.edu/swift/
IEEE Computer article: Parallel scripting for applications at the petascale and beyond - November 2009; most numbers in this presentation are from there
More TODOs: syntax is fairly random (eg ambiguities and redundancies, although much tidier than before) - that comes from "prototype" syntax that we never tidied up. (similar reason to types being a bit of a mess)
Emphasis and define 'loosely coupled' (as being file-coupled, with large execution times on the order of minutes of components, and with communication only happening at those points)
In comparison with embedded DSL: would be nice to have bigger library, as people shift more of their application from the science-language to SwiftScript