About 5 months ago we decided that AWS Elastic Beanstalk will not be enough for our future needs as we scale Lifemote Networks and decided to move our services to AWS ECS. Last week as we moved our third service to ECS we ended up with a problem; in some of our AWS accounts, we assign a separate account and VPC to each customer, our newly deployed third service was not able to reach previously deployed services and as a result, it was dropping requests.
We started by looking for anything odd in the infrastructure; by using CloudFormation drift detection we confirmed that stacks were identical for both accounts. Then we realized that the ECS cluster in the problematic account had two EC2 instances spanned over two availability zone while the one without the problem had only one. This made us think about communication between availability zones but that also turned out to be a red herring as we failed to find any problems with route tables and NAT instances.
Finally, we’ve concluded that it had to be something about security groups and we should learn more about them. I must note here that our ECS cluster has only one security group and therefore all of our ECS EC2 instances have the same one. As expected a misinterpretation of the “security group” concept was our problem.
You see security groups are not real groups; instances that have the same security group do not share a rule set but they each have an identical copy of the same rule set. In other words; two instances are not sitting together behind a perimeter but rather each of them sits behind an identical version of the same perimeter. As a result, if you want two instances that have the same security group to talk to each other you need to make sure that your security group accepts connections from itself. It needs to be self-referencing because it is applied separately for each of those two instances. This problem was hidden from us before because our first two services were smaller and sit on the same machine. The calls between them never left the EC2 instance they are on and therefore never passed through a security group.
Luckily the fix was an easy one with a small caveat. It is very simple to add a self-referencing ingress rule to a security group either with the API or through the web console. However, if you try to add it in a CloudFormation template as you create the security group itself you’ll most likely fail due to a circular dependency. You should use a SecurityGroupIngress resource rather than an inline security group ingress rule.
SecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupName: security-group VpcId: !Ref VPC #adding this to avoid circular dependency SelfReferenceRule: Type: AWS::EC2::SecurityGroupIngress Properties: IpProtocol: "-1" #-1 used for all traffic SourceSecurityGroupId: !Ref SecurityGroup GroupId: !Ref SecurityGroup