We've recently released a few quite important features that make a big practical difference to the reliability and scalability of Idealstack clusters
A lt of people trial Idealstack on t2.micro instances - because they're available for free on the AWS free tier. These instances though are very tiny - a single throttled CPU and 1GB of RAM with no swap.
We tend to find that on Amazon's 'ECS-optimised AMI' that there's a lot of bad failure modes when using ECS on these instances. These images will just 'lock up' when memory becomes constrained. We've tried a lot of things to resolve these lockups - normal Linux systems like the OOM killer or rebooting on OOM don't seem to save these systems. When they 'lock up' you can't SSH to them and they don't respond, but they still appear healthy by other measures, they still have free RAM and some services running on them still work. Even rebooting them doesn't work as they are too sick to shutdown. We have never seen this behaviour on larger EC2 instances, it seems to be specific to the t2.micro instance class, although this might just be because the problem is rarer on these larger instances.
In the end what we've done, which seems to be working very well in our own ECS deployments, is to create a 'health check' container that listens on a certain port, is added to a 'fake' target group in AWS, and which the autoscaling group uses as it's healthcheck. If we make this healthcheck run a PHP page that seems sufficient to detect these errors (note that simply serving a static file was not sufficient - I think you need your healthcheck to start up a new process each time it is run as that's what fails when these machines lockup). If you are running ECS in production and seeing similar problems send us an email and we'll happily share more tech details on this solution.
So long story short, if a server instance goes haywire, now it will automatically be marked as dead, a new instance booted, sites migrated to it, and the old one destroyed.
Idealstack runs each site in an isolated docker container, which has a lot of benefits in terms of security, the ability to run different PHP versions and platforms on one machine, and so on. The only downside is that you have to replicate a lot of services in each container which increases memory usage
Idealstack pushes your website's access and error logs to AWS Cloudwatch, which is a very powerful feature as then they can be searched, metrics created from them and so forth. The downside is that the AWS Logs agent that is needed to do this is a fairly memory-hungry python program. We've moved this into a single container per instance, rather than having to have an instance of it per site. When you are packing a large number of sites per instance, or using small t2.micro instances, this markedly reduces memory usage and leaves more available to service web requests.