I think a robust team/company is the one that questions the status quo.
You mention how they lowered the default Ruby http client connection timeout. Even if the client had a value they were fine to use, it's important to notice those things instead of just assuming "it will work".
A lot of systems have gone down in retry storms because nobody questioned the retry strategies on systems with a very deep chain of calls.
It all starts with critical thinking and not making assumptions.
One bit that is not mentioned. They achieved all of that with good DevOPs practices (that is mentioned) AND Ruby on Rails backend.
I think a robust team/company is the one that questions the status quo.
You mention how they lowered the default Ruby http client connection timeout. Even if the client had a value they were fine to use, it's important to notice those things instead of just assuming "it will work".
A lot of systems have gone down in retry storms because nobody questioned the retry strategies on systems with a very deep chain of calls.
It all starts with critical thinking and not making assumptions.