Disks were a performance bottleneck.
It was a bit a surprise to me, but after actually having monitored the performance of a few sessions now with a decent (~15) number of pilots I've found that disks are the performance bottleneck rather than CPU or memory. In the sim we experienced some pauses / desyncing, but the corresponding observed performance was a sudden increase in disk activity with a corresponding dip in network activity. How I interpreted this is the sim decided it needed some data from disk and wasn't prepared to continue until it had retrieved the data it needed.
This fits with the reality of cloud computing in that disks are typically made available as logical network drives as opposed to directly attached phsyical disks. Whilst throughput isnt't an issue, network disks can't compete with directly attached disks for latency simply because the signal has less distance to travel. Also the way AWS Elastic Block Store (EBS) works when its (lazily) provisoined from a snapshot means that its latency will be higher for the first reads from each section of the disk. It's this disk io latency that I think is the cause of pasues and stutters to date. The good news though is that directly attached storge is available, though that comes with its own challanges.
The main challenge with attached disks is that they need to be fully recreated each time a new instance is started. Since an installation of DCS standalone is ~50G, this costs time, so the tradeoff becomes a longer startup for a smoother running mission. With some work though I think I can keep the startup time managable. This also means that I'll stop offering the smaller instance types that dont come with attached disks.