Copy-On-Write 101 – Part 3: The Sum of Its Parts

by Ville Laurikari on Tuesday, July 21, 2009

Read the intro, part 1, and part 2 first.

This is a description of a fictitious programming language. It is purposefully vague here and there. Nevertheless, I don’t think I’m exactly reaching for the moon here. Or perhaps I am, you tell me. Point is, this is just fantastic, wholesome, and thoroughly enjoyable academic masturbation. Shall we begin?

We need a name for the language, so let’s call it Shark.

“As far non-existent programming languages go, Shark is the dog’s bollocks.”

Shark uses a purely copy-on-write data model. By this I mean that, as far as the programmer can tell, values are never modified. Once you create an integer, a string, a list, or a forest of suffix trees containing all the words in the Klingon language, you cannot modify it.

Shark is, of course, garbage collected. Shark is also strongly typed, supports both static and dynamic typing, and has type inference. You have your lambdas, partial evaluation, algebraic data types, pattern matching, a fantastic module system, and a clean and concise but not too unfamiliar syntax. As far non-existent programming languages go, Shark is the dog’s bollocks.

Shark has lightweight threads. Spawning new threads is almost free (only a handful of instructions on the CPU). Threads are automatically distributed to your CPUs and cores. Just to be absolutely crystal clear, Shark threads are not the same as operating system threads such as POSIX threads. Shark threads communicate exclusively through message passing. Each thread has a message queue from which it can read incoming messages when it wants to.

Shark eats LINQ to SQL for breakfast.

There are no locks, condition variables, and no modifiable shared resources. When threads share data they do it explicitly, in a controlled fashion, for selected data only. Shark comes with built-in support for software transactional memory. You can easily build databases from whatever data structures are the most natural for your application. You can kiss all that RDBMS crap goodbye. No more shoehorning your data into an awkward tables-and-columns format, no more marshalling and demarshalling your data to whatever data types the database happens to support, and no more SQL.

The natural model to maintain state in Shark programs is to have a single thread govern the evolution of that state. The thread can then pass around copies of the state, maybe receive operations from other threads to update the state, and so on. Typically the state takes the form of an STM database which can be updated transactionally from multiple threads concurrently.

Just one more thing. Shark STM databases can optionally be stored on disk. It’s so efficient that you can run tens of thousands of transactions per second and have them checkpointed to disk a dozen times per second. This gets you the Durability in ACID, the one piece that regular STM normally doesn’t provide. Even though the entire database must reside in main memory, you’ll need to have a pretty big database for that to become a problem. RAM is insanely cheap nowadays.

So, that’s my hallucinogen-induced (not really) vision of the perfect solution for concurrent programming and database management. Yeah, I know creating your own programming language is basically never the right thing to do. In my defense, at least I haven’t implement Shark. Not yet, at least…

Related posts:

  1. I Have Seen the Future, and It is Copy-On-Write
  2. Copy-On-Write 101 – Part 2: Putting It to Use
  3. Copy-On-Write 101 – Part 1: What Is It?
  4. The Three Doors of Type Systems
  5. Programming Problems in Disguise

If you liked this, click here to receive new posts in a reader.
You should also follow me on Twitter here.

Comments on this entry are closed.

{ 10 comments }

Vagn Johansen July 22, 2009 at 18:56

I don’t understand the paragraph with “The natural model..”. How does this happen “concurrently”? Via messages to the owner thread?

Ville Laurikari July 22, 2009 at 22:05

Vagn, that paragraph could have been more clear, for sure. I’ll try
to clarify how STM works with copy-on-write data structures.

For an STM database there is one “owner” thread which keeps the
latest snapshot of the database. Other “worker” threads can request
the latest snapshot to work on. They do this by sending a message to
the owner which then replies back with the snapshot.

The worker thread then goes off an does whatever it wants to with the
database, possibly updating it. When it has finished with the
transaction, the worker sends its modified database back to the owner,
along with a log of all reads and writes it performed.

The owner thread will then replay the log with the actual latest
database snapshot. If there are no conflicts, the owner replaces
the latest snapshot with the one resulting from replaying the log. If
there are conflicts, the owner keeps the old snapshot and thus
discards all updates from the worker. The worker thread can then
retry the transaction again with the new latest snapshot.

The owner thread can take some shortcuts. If there were no other
concurrent updates, it can directly accept the new database snapshot
received from the worker. If the worker did not do any writes there is no need for conflict resolution.

Does this answer your question?

Vagn Johansen July 23, 2009 at 11:04

Yes, almost. The explanation of the replay log explains why you call
it an STM database. It is like a small program that can be
replayed/re-run when a conlict is detected.

The way you describe it, the owner thread can update the database and
at the same time a worker thread can traverse the database (via a
different snapshot). Being a copy-on-write data structure the owner
thread can delete a node that the worker thread is trying to traverse
due to the node sharing (assuming something like the tree in part1).

I assume this means that the data structure for the STM database has
some form of thread-safety built-in? And regarding the conflict detection
does it happens both during replays on the owner thread and during
traversals in the worker thread?

Ville Laurikari July 23, 2009 at 14:31

All data structures in Shark are copy-on-write. Because existing data structures are never modified in-place, everything is automatically thread safe. So yes, the data structure for the STM database is thread safe, like all other data structures in Shark.

Conflict detection only happens when replaying the log. The worker threads are oblivious to the fact that other worker threads may be concurrently using the same database. They’ll find out later if there was a conflict.

alxtoth August 10, 2009 at 23:07

Hi,

Apparently you like LISP-like languages . Check out Clojure (http://www.clojure.org), it has “transactional memory”

-Alex

Ville Laurikari August 11, 2009 at 17:12

@axltoth, thanks. You’re not the only one to mention Clojure (see here). Clojure doesn’t have automatically disk-backed data structures, Shark does. On the other hand, Shark doesn’t exist… so I guess Clojure wins, eh?

Alain January 11, 2011 at 08:55

I share your vision: the future of multi-processing is copy-on-write. Transactional memory is a good start, but there’s more to it than that.

Todd hubers September 5, 2012 at 04:26

Love it. I have found myself in the last month looking deeper into “Snapshot isolation” and “Multiversion concurrency control”. Your descriptions of COW appear to go further and more beneficial.

I have also been considering writing a .NET DB in memory, or direct to disk. I am also a big fan of the Singularity research by Microsoft, whereby messages are passed and shared memory can be referenced (relinquished upon messaging, and enforced by install-time and even runtime contract inspection).

Your vision of “Shark” sounds great. However there would still need to be an SQL interface for adhoc reporting. But that need not be a hurdle, merely just another interface to your application (just like WebServices etc), with a somewhat common SQL engine library to import.

I think we are on the cusp of major changes in the way software works, but it requires the big players to become involved and all the right pieces coming into place. We can’t alienate all of the old-school programmers with their legacy software babies.

Bryan Ewbank December 12, 2013 at 19:50

Thanks for the COW overview. Are you still blogging somewhere else? Your insights and style are valuable for this old hacker.

Ville Laurikari December 13, 2013 at 20:44

Thanks, Bryan! I’m not blogging anywhere else at the moment. I’ve been bootstrapping my own business(es) and might take up blogging again some time in the future, to write about what I’ve learned along the way.

Previous post:

Next post: