A deployment of our API at 8:15 PDT triggered downscaling of the service’s available pods during a period of high traffic. Once the new pods were ready, it took approximately 5 minutes for our autoscaling to rescale the service back up to the number of pods necessary to handle all the traffic, so during that period we saw a high rate of 502 errors.
In response to this issue we have improved our autoscaling logic to start spin up 10 new instances before switching traffic to the new deployment version.