Architecture
TODO mirar:
- https://www.manning.com/books/the-coder-cafe
- https://www.manning.com/books/grokking-software-architecture
- Software Architecture Fundamentals, Third Edition - https://www.oreilly.com/videos/software-architecture-fundamentals/0642572016094/
https://bytebytego.com - https://highscalability.com
https://github.com/donnemartin/system-design-primer
https://github.com/karanpratapsingh/system-design - https://www.karanpratapsingh.com/courses/system-design - https://leanpub.com/systemdesign
https://github.com/bregman-arie/system-design-notebook
https://github.com/binhnguyennus/awesome-scalability
https://www.developertoarchitect.com - Mark Richards
Software Architecture Monday - https://www.developertoarchitect.com/lessons/ - https://www.youtube.com/@markrichards5014
Cloud design patterns - https://learn.microsoft.com/en-us/azure/architecture/patterns/
Engineering at N26: a Tour of our Tech Stack and Architecture - https://medium.com/insiden26/engineering-at-n26-a-tour-of-our-tech-stack-and-architecture-9e58ce96f889
https://github.com/ByteByteGoHq/system-design-101
Self-contained Systems - https://scs-architecture.org
https://github.com/dapr/dapr - https://dapr.io
Virtual Waiting Room on AWS - To sell concert tickets. Uses an SQS queue - https://aws.amazon.com/solutions/implementations/virtual-waiting-room-on-aws/
Every system has a hot path - https://www.linkedin.com/posts/raul-junco_every-system-has-a-hot-path-and-its-the-activity-7396182757405638656-lpWS/
Change Data Capture (CDC) - Outbox Pattern - https://www.linkedin.com/posts/raul-junco_i-have-seen-this-mistake-in-production-the-activity-7434222637809033216-dUU3/

How to retry payments - https://newsletter.systemdesignclassroom.com/p/every-backend-engineer-needs-to-know
Diagrams
Software Architecture Fundamentals—Diagramming and Documenting Architecture - https://www.oreilly.com/videos/software-architecture-fundamentals-diagramming/0636920342540/
System Design
System Design Staircase - https://www.linkedin.com/posts/raul-junco_system-design-isnt-one-big-concept-it-activity-7386008937277472768-Xfnq/
https://algomaster.io/learn/system-design/course-introduction
https://algomaster.io/learn/system-design-interviews/introduction
Every system eventually fails or scales at its data layer first, not its API or cache. That's why strong database design, indexing, and query optimization are the real foundations of scalability. source
System design checklist - https://www.linkedin.com/posts/raul-junco_i-treat-system-design-like-a-checklist-activity-7415014883634905088-RpTW/
Fundamentals
- What are the read patterns?
- What are the write patterns?
- Who owns the source of truth?
- Is consistency or availability more critical?
- Single writer or multiple writers?
Architecture
- Synchronous or async?
- Do I need a queue, or is a cron job enough?
- Can I separate the compute from storage?
- Stateless or stateful services?
- Contracts versioned?
Reliability
- What happens when this fails?
- Where’s the retry logic, and is it idempotent?
- Are we alerting to symptoms or root causes?
- Timeouts configured?
Scaling
- How do reads scale?
- How do writes scale?
- Will this design hold up at 10x traffic?
- What’s the hot path, and how do we optimize it?
Observability
- Do we log what we need to debug in production?
- Can we trace a request across services?
- What metrics define “healthy”?
- Debuggable without redeploy?
Coupling
https://en.wikipedia.org/wiki/Coupling_(computer_programming)
Loose coupling benefit: reduce interdependencies so a failure in one component does not cascade to other components.
Stamp coupling: modules share a data structure, but each modules only uses a part (a subset of fields) of it. You overspecify a contract, which creates coupling.
Synchronous vs Asynchronous communication
If you want to make a process asynchronous, you must manage the way the process initiator tracks the process status. One way of doing that is to return an ID to the initiator that can be used to look up the process. During the process, this ID is passed from step to step. (AWS in Action p. 401.)
When designing an asynchronous process, it’s important to keep track of the process. You need some kind of identifier for it. (AWS in Action p. 441.)
Sync (request/response):
- Usually done with REST over HTTP or RPC.
- Easy to communicate the end result (success or error) to the receiver, and therefore to the end user.
- Messages are pushed (by the sender to the receiver).
- The sender knows the receiver.
- The message is handled immediately.
- The receiver needs to respond, but if it doesn't the sender will know.
- The receiver returns a direct response to the sender.
- Direct coupling.
Async (event-driven):
- Fire and forget.
- Messages are stored in a queue or a stream (a middleman).
- Messages are pulled by the receiver.
- Both the sender and the receiver don't know each other.
- Because the receiver doesn't know the sender, it can't send a direct reply.
- Messages are not handled immediately. The consumer processes messages as its own pace.
- More resilient: if the consumer of the event is down, we can buffer the events in a queue and continue later.
- Complex to build and debug.
- Communication between services is not obvious.
- Error handling is difficult.
- Transactions are difficult.
- Highly decoupled systems.
- Performance hit due to communication, work done in multiple services etc.
- High performance and scalability.
- Respond with HTTP status code 202 Accepted.
Atomic vs Eventual consistency
Atomicity: we do it at the application layer, not the database.
Compensating updates can fail.
https://aphyr.com/posts/313-strong-consistency-models
Remember, without transactions, we can only build “eventually consistent” systems.
ACID vs BASE transactions
ACID:
- Atomicity, consistency, isolation and durability
- All updates or nothing
BASE:
- Basic availability, soft state and eventual consistency
- For distributed systems
Even if services are sharing the same database, we don't support ACID transactions
BASE Transactions and Eventual Consistency - https://www.youtube.com/watch?v=I47J2I-SVi0
Prefer ACID over BASE - https://microservices.io/articles/dark-energy-dark-matter/dark-matter/prefer-acid-over-base.html
From https://chrisrichardson.net/virtual-bootcamp-distributed-data-management.html
To ensure a loose coupling, you can only use ACID transactions within a service. Between services, you must implement transactions using the Saga pattern and queries using the API Composition and CQRS patterns. As a result, it’s no longer straightforward to implement transactions and queries that are correct, efficient and resilient.
Orchestration vs choreography coordination
https://learn.microsoft.com/en-us/azure/architecture/patterns/choreography
Orchestration:
- The orchestrator owns the state. For example, the state of an order in the orchestrator can be created, payment_applied, order_shipped, order_complete...
- Each major workflow will have its own orchestrator.
- The orchestrator is a single point of failure. If it fails, nothing works. But if the state is persisted, we can continue processing when it's back.
- Each service is independent, and they do not communicate each other (the only talk to the orchestrator), so they are highly decoupled.
- Error handling is done by the orchestrator. For example, if there's no items to fulfill an order, it's the orchestrator who decides what to do. Services are simpler as a result.
Choreography:
- Services communicate directly. They need to know each other, and thus are more complex.
- Much more responsive, scalable and fault tolerant.
- Error workflows are very complex. They require additional communication between services. Error handling is an architectural concern.
- We can extract error handling into an orchestrator for errors to simplify services and reduce its communication.
- It's difficult to know what's the overall state. For example, if we want to know the status of an order. There may be a service that is the state owner. State ownership is an architectural concern.
Sagas
https://learn.microsoft.com/en-us/azure/architecture/patterns/saga
https://microservices.io/patterns/data/saga.html
https://learn.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction
Avoid atomic + async together, and choreography + sync.
Epic
- All synchronous communication → easy to communicate the end result (success/error) to the user.
- Blocking calls → long time to run.
- If there are 3 calls to be done, and each call takes x time, it will take 3x.
- If there's an error and we have to compensate, it's 5x.
- Easy to understand because it mimics synchronous method calls, but difficult to implement.
- Most coupled solution, thus not scalable.
- Often the first thing you attempt.
- Sometimes the output of a step needs to be the input of another step, so synchronous is required.
Fantasy fiction story
- Much faster communication. No need to wait for a response. We can interact with various services in parallel.
- If there are 3 calls to be done, and each one takes x time, it will take 1x.
- More responsive and better performance.
- There can be concurrency issues.
- Fix attempt at fixing the epic saga due to performance problems.
- Consistency hurts communication. Coordination to achieve atomic workflows.
Fairy tale
- Eventually consistent → more responsive. We are doing an early return.
- When an error happens, if compensating updates fail, we are in an inconsistent state.
- If there is an error and I have to compensate, and there are 3 calls to be done, and each call takes x time this case, total is 3x.
- Best simplicity and scalability.
Parallel
- Allows for complex workflows.
- Very responsive and scalable.
Phone tag
- Least common.
Horror story
- Hardest one to reason about.
- Anti-pattern.
Time travel
- No concurrency → easy to reason about.
Anthology
- Opposite of epic
- Decoupled, responsive and scalable.
Distributed Systems
Avoiding fallback in distributed systems - https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/ - https://d1.awsstatic.com/builderslibrary/pdfs/avoiding-fallback-in-distributed-systems.pdf
https://www.thoughtworks.com/radar/platforms/restate
We still maintain that it's best to avoid distributed transactions in distributed systems, because of both the additional complexity and the inevitable additional operational overhead involved
Microservices first? https://arnon.me/2025/03/services-vs-monoliths-tradeoffs/
https://www.manning.com/books/think-distributed-systems
The Eight Fallacies of Distributed Computing
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
https://arnon.me/wp-content/uploads/Files/fallacies.pdf
https://www.researchgate.net/publication/322500050_Fallacies_of_Distributed_Computing_Explained
Mark Richards, Neal Ford. See Software Architecture Fundamentals 2nd Edition and https://www.developertoarchitect.com/lessons/lesson18.html
- Versioning is easy
- Compensating updates always work
- Observability is optional
Event-driven architecture
An event is a fact that something has happened.
Event-driven microservices demo built with Golang. Nomad, Consul Connect, Vault, and Terraform for deployment - https://github.com/thangchung/go-coffeeshop
https://www.developertoarchitect.com/lessons-eda.html
https://en.wikipedia.org/wiki/Event-driven_architecture
Queue
Kafka vs RabbitMQ vs SQS vs Solace - https://www.linkedin.com/posts/rocky-bhatia-a4801010_kafka-vs-rabbitmq-vs-sqs-vs-solace-choosing-share-7455218155771260928-3Inx/
You don’t pick tools based on what’s cool. You pick based on constraints.
Topics vs Queues
Here are the 5 questions that decide the right one:
- One worker or many?
If one consumer should process a message -> Queue.
If many consumers need the same message -> Topic.Simple rule:
Queue = throughput.
Topic = fan-out.
- Can you lose messages?
If losing a message is unacceptable -> Queue wins.
Topics need more config to get the same safety guarantees.
- Are you scaling workload or audience?
Queues scale workload (parallelism).
Topics scale audience (more listeners).
Most engineers confuse the two.
- What if a consumer dies?
Queues handle tracking for you.
Topics make you handle offsets + state.This complexity hurts when volume explodes.
- How fast is the system evolving?
New system, changing requirements? -> Topic gives you flexibility.
Stable system, clear workflow? -> Queue gives you simplicity.My recommendation:
Start with a Queue.
When you actually need fan-out, evolve to a Topic.
- Streams = work already done
- Queues = work to be done
“Is this something that happened… or something we need to do?”
Streams hold truth. Queues do work.
When you get this wrong, your system bleeds.
I’ve seen $10M mistakes from teams who dump everything into queues: “Just push the order to a queue and process it!”
Most Queues delete messages after work is done. No history. No replay. No audit. Just pain and guesswork.
On the other side, some teams fall in love with Kafka: “We’ll stream EVERYTHING!”
Here’s the rule I wish someone told me early:
If the event changes the business → Stream If the message is an action to perform → Queue
Streams = OrderPlaced, PaymentAuthorized, InventoryReserved These are immutable facts. They must be durable, replayable, ordered, and auditable.
Queues = SendEmail, CapturePayment, GenerateInvoice These tasks exist temporarily. They matter NOW, not 6 months from now.
Event enters the stream → workers derive jobs → queues execute tasks
Ledger first. Assembly line second.
Load Balancer vs API Gateway
A load balancer has one job: distribute traffic. Clients send HTTP(s) requests from web, mobile, or IoT apps, and the load balancer spreads those requests across multiple server instances so no single server takes all the load.
Load balancer:
- Traffic distribution
- Health checks to detect downed servers
- Failover when something breaks
- L4/L7 balancing depending on whether you're routing by IP or by actual HTTP content.
API gateway:
- Rate limiting to prevent abuse.
- API aggregation so your client doesn't need to call five different services.
- Observability for logging and monitoring.
- Authentication and authorization before a request even touches your backend.
- Request and response transformation to reshape payloads between client and service formats.
Backends for Frontends
https://learn.microsoft.com/en-us/azure/architecture/patterns/backends-for-frontends
Rate Limiting
https://learn.microsoft.com/en-us/azure/architecture/patterns/rate-limiting-pattern