Discover more from Tech World With Milan Newsletter
This week’s issue brings you the following:
How Shopify builds resilient payment systems
Top 10 NASA rules for better coding
So, let’s dive in.
Postman Collections (Sponsored)
Postman Collections enable exceptional API organization. Postman Collections are groups of saved API requests that can be shared with others. These requests may represent a specific workflow, and they may also function as an API test suite. With collections, you can link related API elements together for easy editing, sharing, testing, and reuse.
On the last black Friday, Shopify made impressive results:
Some details are:
145 billion requests (~60 million per minute)
99.999+% uptime
5 TB/min of data served from across the infrastructure
MySQL 5.7 and 8 fleets handled over 19 million requests per second (QPS)
22 GB/sec of logs and 51.4 GB/sec of metrics data
Ingested 9 million spans a second of tracing data
Their Apache Kafka served 29 million messages per second at peak
Everything is run on Google Cloud
But how did they manage to do it ? The recent article by Shopify Engineering explained the top 10 most useful tips and tricks for building resilient payment systems:
Lower your timeouts. They suggest investigating and setting low timeouts everywhere possible. For instance, Ruby's built-in Net::HTTP client has a default timeout of 60 seconds to open a connection, write data, and read a response. This is too long for online applications where a user is waiting.
Install Circuit Breakers. Circuit breakers, like Shopify's Semian, protect services by raising an exception once a service is detected as being down. This saves resources by not waiting for another timeout.
Understand capacity. The author discusses Little's Law, which states that the average number of customers in a system equals their average arrival rate multiplied by their average time. Understanding this relationship between queue size, throughput, and latency can help design systems that can handle load efficiently.
Add monitoring and alerting. Google's Site Reliability Engineering (SRE) book lists four golden signals a user-facing system should be monitored for latency, traffic, errors, and saturation. Monitoring these metrics can help identify when a system is at risk of going down due to overload.
Implement structured logging. They recommend using structured logging in a machine-readable format, like
key=value
pairs or JSON allows log aggregation systems to parse and index the data and correlation IDs passed along the API calls to find all related logs for the payment attempt.Use Idempotency Keys. To ensure payment or refund happens exactly once, they recommend using Idempotency keys, which track attempts and provide only a single request sent to financial partners.
Be consistent with reconciliation. Reconciliation ensures that records are consistent with those of financial partners. Any discrepancies are recorded and automatically remediated where possible.
Incorporate Load testing. Regular load testing helps test systems limits and protection mechanisms by simulating large-volume flash sales. Shopify uses scriptable load balancers to throttle the number of checkouts happening at any time.
Get on top of incident management. Shopify uses a Slack bot to manage incidents, with roles for coordinating the incident, public communication, and restoring stability. This process starts when the on-call service owners get paged, either by an automatic alert based on monitoring or by hand, if someone notices a problem.
Organize incident retros. Retrospective meetings are held within a week after an incident to understand what happened, correct incorrect assumptions, and prevent the same thing from happening again.
To learn more about Shopify architecture, check here.
Top 10 NASA Rules for Better Coding
The Power of 10 Rules was formulated in 2006 by Gerard J. Holzmann at NASA's JPL Laboratory for Reliable Software, aiming to eliminate certain C coding practices that make code hard to review or statically analyze. These rules are included in the more significant set of JPL coding standards.
The rules are:
Avoid Complex Flow. Steer clear of tricky control structures; stick to simple loops and conditionals.
Bound Loops. Ensure loops have a clear exit point to prevent endless looping.
Avoid Heap Allocation. Favor stack or static memory allocation to dodge memory leaks.
Use Short Functions. Keep functions concise, handling a single task. This goes along well with Clean Code practices (Signe responsibility principle).
Runtime Assertions. Utilize assertions to catch unexpected conditions.
Limited Data Scope. Keep the scope narrow to maintain clarity. Use the smallest scope for your variables (e.g., private or protected in C#).
Check Return Values. Always check the return values of functions, handling any errors.
Sparse Preprocessor Use. Minimize preprocessor directives for readability.
Limit Pointer Use. Simplify pointer use and avoid function pointers for clearer code.
Compile With All Warnings Enabled. Address all compiler warnings to catch potential issues early. This is often neglected in many projects!
BONUS: Backend burger
Check out this Full Back-end Roadmap.
Sonar (Sponsored)
Sonar allows development teams to minimize risk, ensure code quality, and derive more value from code created by both AI and humans. Learn how Sonar’s solutions offer the best path toward adopting AI code generation in our upcoming live webinar.
One bit that is not mentioned. They achieved all of that with good DevOPs practices (that is mentioned) AND Ruby on Rails backend.
I think a robust team/company is the one that questions the status quo.
You mention how they lowered the default Ruby http client connection timeout. Even if the client had a value they were fine to use, it's important to notice those things instead of just assuming "it will work".
A lot of systems have gone down in retry storms because nobody questioned the retry strategies on systems with a very deep chain of calls.
It all starts with critical thinking and not making assumptions.