Architecture

TODO mirar:

Engineering at N26: a Tour of our Tech Stack and Architecture - https://medium.com/insiden26/engineering-at-n26-a-tour-of-our-tech-stack-and-architecture-9e58ce96f889

https://github.com/ByteByteGoHq/system-design-101

Self-contained Systems - https://scs-architecture.org

https://github.com/dapr/dapr - https://dapr.io

Virtual Waiting Room on AWS - To sell concert tickets. Uses an SQS queue - https://aws.amazon.com/solutions/implementations/virtual-waiting-room-on-aws/

https://www.linkedin.com/posts/raul-junco_system-design-is-the-art-of-making-scale-activity-7392184373762162688-TQ5E/

Every system has a hot path - https://www.linkedin.com/posts/raul-junco_every-system-has-a-hot-path-and-its-the-activity-7396182757405638656-lpWS/

Change Data Capture (CDC) - Outbox Pattern - https://www.linkedin.com/posts/raul-junco_i-have-seen-this-mistake-in-production-the-activity-7434222637809033216-dUU3/

3 patterns to deal with Eventual Consistency — Source: LinkedIn

How to retry payments - https://newsletter.systemdesignclassroom.com/p/every-backend-engineer-needs-to-know

https://www.linkedin.com/posts/raul-junco_ive-seen-engineers-with-a-decade-of-experience-share-7470821243060031488-OAbD/

❌ "Our data architecture is React + Node.js + MongoDB"

✅ "We use a microservices architecture with event-driven communication, RESTful APIs, and a document-oriented database for scalability and flexibility"

Diagrams

Software Architecture Fundamentals—Diagramming and Documenting Architecture - https://www.oreilly.com/videos/software-architecture-fundamentals-diagramming/0636920342540/

System Design

System Design Staircase - https://www.linkedin.com/posts/raul-junco_system-design-isnt-one-big-concept-it-activity-7386008937277472768-Xfnq/

https://algomaster.io/learn/system-design/course-introduction

https://algomaster.io/learn/system-design-interviews/introduction

Databases - https://www.linkedin.com/posts/raul-junco_in-one-system-design-interview-i-used-kafka-activity-7384919515282817024-wcFP/

Every system eventually fails or scales at its data layer first, not its API or cache. That's why strong database design, indexing, and query optimization are the real foundations of scalability. source

System design checklist - https://www.linkedin.com/posts/raul-junco_i-treat-system-design-like-a-checklist-activity-7415014883634905088-RpTW/

Fundamentals

What are the read patterns?
What are the write patterns?
Who owns the source of truth?
Is consistency or availability more critical?
Single writer or multiple writers?

Architecture

Synchronous or async?
Do I need a queue, or is a cron job enough?
Can I separate the compute from storage?
Stateless or stateful services?
Contracts versioned?

Reliability

What happens when this fails?
Where’s the retry logic, and is it idempotent?
Are we alerting to symptoms or root causes?
Timeouts configured?

Scaling

How do reads scale?
How do writes scale?
Will this design hold up at 10x traffic?
What’s the hot path, and how do we optimize it?

Observability

Do we log what we need to debug in production?
Can we trace a request across services?
What metrics define “healthy”?
Debuggable without redeploy?

Coupling

https://en.wikipedia.org/wiki/Coupling_(computer_programming)

Loose coupling benefit: reduce interdependencies so a failure in one component does not cascade to other components.

Stamp coupling: modules share a data structure, but each modules only uses a part (a subset of fields) of it. You overspecify a contract, which creates coupling.

Synchronous vs Asynchronous communication

tip

If you want to make a process asynchronous, you must manage the way the process initiator tracks the process status. One way of doing that is to return an ID to the initiator that can be used to look up the process. During the process, this ID is passed from step to step. (AWS in Action p. 401.)

When designing an asynchronous process, it’s important to keep track of the process. You need some kind of identifier for it. (AWS in Action p. 441.)

Sync (request/response):

Usually done with REST over HTTP or RPC.
Easy to communicate the end result (success or error) to the receiver, and therefore to the end user.
Messages are pushed (by the sender to the receiver).
The sender knows the receiver.
The message is handled immediately.
The receiver needs to respond, but if it doesn't the sender will know.
The receiver returns a direct response to the sender.
Direct coupling.

Async (event-driven):

Fire and forget.
Messages are stored in a queue or a stream (a middleman).
Messages are pulled by the receiver.
Both the sender and the receiver don't know each other.
Because the receiver doesn't know the sender, it can't send a direct reply.
Messages are not handled immediately. The consumer processes messages as its own pace.
More resilient: if the consumer of the event is down, we can buffer the events in a queue and continue later.
Complex to build and debug.
Communication between services is not obvious.
Error handling is difficult.
Transactions are difficult.
Highly decoupled systems.
Performance hit due to communication, work done in multiple services etc.
High performance and scalability.
Respond with HTTP status code 202 Accepted.

Atomic vs Eventual consistency

Atomicity: we do it at the application layer, not the database.

Compensating updates can fail.

https://aphyr.com/posts/313-strong-consistency-models

https://jepsen.io/consistency

https://www.linkedin.com/posts/raul-junco_i-have-seen-this-mistake-in-production-the-activity-7434222637809033216-dUU3/

Remember, without transactions, we can only build “eventually consistent” systems.

ACID vs BASE transactions

ACID:

Atomicity, consistency, isolation and durability
All updates or nothing

BASE:

Basic availability, soft state and eventual consistency
For distributed systems

Even if services are sharing the same database, we don't support ACID transactions

BASE Transactions and Eventual Consistency - https://www.youtube.com/watch?v=I47J2I-SVi0

Prefer ACID over BASE - https://microservices.io/articles/dark-energy-dark-matter/dark-matter/prefer-acid-over-base.html

From https://chrisrichardson.net/virtual-bootcamp-distributed-data-management.html

To ensure a loose coupling, you can only use ACID transactions within a service. Between services, you must implement transactions using the Saga pattern and queries using the API Composition and CQRS patterns. As a result, it’s no longer straightforward to implement transactions and queries that are correct, efficient and resilient.

Orchestration vs choreography coordination

https://learn.microsoft.com/en-us/azure/architecture/patterns/choreography

Orchestration:

The orchestrator owns the state. For example, the state of an order in the orchestrator can be created, payment_applied, order_shipped, order_complete...
Each major workflow will have its own orchestrator.
The orchestrator is a single point of failure. If it fails, nothing works. But if the state is persisted, we can continue processing when it's back.
Each service is independent, and they do not communicate each other (the only talk to the orchestrator), so they are highly decoupled.
Error handling is done by the orchestrator. For example, if there's no items to fulfill an order, it's the orchestrator who decides what to do. Services are simpler as a result.

Choreography:

Services communicate directly. They need to know each other, and thus are more complex.
Much more responsive, scalable and fault tolerant.
Error workflows are very complex. They require additional communication between services. Error handling is an architectural concern.
- We can extract error handling into an orchestrator for errors to simplify services and reduce its communication.
It's difficult to know what's the overall state. For example, if we want to know the status of an order. There may be a service that is the state owner. State ownership is an architectural concern.

Sagas

https://learn.microsoft.com/en-us/azure/architecture/patterns/saga

https://microservices.io/patterns/data/saga.html

https://learn.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction

Avoid atomic + async together, and choreography + sync.

Epic

All synchronous communication → easy to communicate the end result (success/error) to the user.
Blocking calls → long time to run.
- If there are 3 calls to be done, and each call takes x time, it will take 3x.
- If there's an error and we have to compensate, it's 5x.
Easy to understand because it mimics synchronous method calls, but difficult to implement.
Most coupled solution, thus not scalable.
Often the first thing you attempt.
Sometimes the output of a step needs to be the input of another step, so synchronous is required.

Fantasy fiction story

Much faster communication. No need to wait for a response. We can interact with various services in parallel.
- If there are 3 calls to be done, and each one takes x time, it will take 1x.
More responsive and better performance.
There can be concurrency issues.
Fix attempt at fixing the epic saga due to performance problems.
Consistency hurts communication. Coordination to achieve atomic workflows.

Fairy tale

Eventually consistent → more responsive. We are doing an early return.
When an error happens, if compensating updates fail, we are in an inconsistent state.
- If there is an error and I have to compensate, and there are 3 calls to be done, and each call takes x time this case, total is 3x.
Best simplicity and scalability.

Parallel

Allows for complex workflows.
Very responsive and scalable.

Phone tag

Least common.

Horror story

Hardest one to reason about.
Anti-pattern.

Time travel

No concurrency → easy to reason about.

Anthology

Opposite of epic
Decoupled, responsive and scalable.

Distributed Systems

Avoiding fallback in distributed systems - https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/ - https://d1.awsstatic.com/builderslibrary/pdfs/avoiding-fallback-in-distributed-systems.pdf

https://www.thoughtworks.com/radar/platforms/restate

We still maintain that it's best to avoid distributed transactions in distributed systems, because of both the additional complexity and the inevitable additional operational overhead involved

Microservices first? https://arnon.me/2025/03/services-vs-monoliths-tradeoffs/

https://www.manning.com/books/think-distributed-systems

The Eight Fallacies of Distributed Computing

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

https://arnon.me/wp-content/uploads/Files/fallacies.pdf

https://www.researchgate.net/publication/322500050_Fallacies_of_Distributed_Computing_Explained

Mark Richards, Neal Ford. See Software Architecture Fundamentals 2nd Edition and https://www.developertoarchitect.com/lessons/lesson18.html

Versioning is easy
Compensating updates always work
Observability is optional

Event-driven architecture

An event is a fact that something has happened.

Event-driven microservices demo built with Golang. Nomad, Consul Connect, Vault, and Terraform for deployment - https://github.com/thangchung/go-coffeeshop

https://www.developertoarchitect.com/lessons-eda.html

https://en.wikipedia.org/wiki/Event-driven_architecture

Queue

Kafka vs RabbitMQ vs SQS vs Solace - https://www.linkedin.com/posts/rocky-bhatia-a4801010_kafka-vs-rabbitmq-vs-sqs-vs-solace-choosing-share-7455218155771260928-3Inx/

https://www.linkedin.com/feed/update/urn:li:activity:7467372636730343424

Use idempotent consumers → prevents double processing.
Define retry policies (exponential backoff, max attempts).
Monitor queue length & processing lag as health indicators.
Use dead letter queues for failed messages.
Ensure message ordering only when business-critical (ordering adds cost/complexity).
Keep messages small & self-contained.
Always include correlation IDs for traceability.

https://www.linkedin.com/posts/raul-junco_you-dont-pick-tools-based-on-whats-cool-activity-7393634074793521152-ITB_/

You don’t pick tools based on what’s cool. You pick based on constraints.

Topics vs Queues

Here are the 5 questions that decide the right one:

One worker or many?

If one consumer should process a message -> Queue.
If many consumers need the same message -> Topic.

Simple rule:

Queue = throughput.
Topic = fan-out.

Can you lose messages?

If losing a message is unacceptable -> Queue wins.
Topics need more config to get the same safety guarantees.

Are you scaling workload or audience?

Queues scale workload (parallelism).
Topics scale audience (more listeners).
Most engineers confuse the two.

What if a consumer dies?

Queues handle tracking for you.
Topics make you handle offsets + state.

This complexity hurts when volume explodes.

How fast is the system evolving?

New system, changing requirements? -> Topic gives you flexibility.
Stable system, clear workflow? -> Queue gives you simplicity.

My recommendation:

Start with a Queue.
When you actually need fan-out, evolve to a Topic.

https://www.linkedin.com/posts/raul-junco_most-engineering-teams-screw-up-messaging-activity-7401244415689871361-tUEn/

Streams = work already done
Queues = work to be done

“Is this something that happened… or something we need to do?”

Streams hold truth. Queues do work.

When you get this wrong, your system bleeds.

I’ve seen $10M mistakes from teams who dump everything into queues: “Just push the order to a queue and process it!”

Most Queues delete messages after work is done. No history. No replay. No audit. Just pain and guesswork.

On the other side, some teams fall in love with Kafka: “We’ll stream EVERYTHING!”

Here’s the rule I wish someone told me early:

If the event changes the business → Stream If the message is an action to perform → Queue

Streams = OrderPlaced, PaymentAuthorized, InventoryReserved These are immutable facts. They must be durable, replayable, ordered, and auditable.

Queues = SendEmail, CapturePayment, GenerateInvoice These tasks exist temporarily. They matter NOW, not 6 months from now.

Event enters the stream → workers derive jobs → queues execute tasks

Ledger first. Assembly line second.

Load Balancer vs API Gateway

https://www.linkedin.com/posts/alexxubyte_systemdesign-coding-interviewtips-activity-7441872725410824192-LYfd/

A load balancer has one job: distribute traffic. Clients send HTTP(s) requests from web, mobile, or IoT apps, and the load balancer spreads those requests across multiple server instances so no single server takes all the load.

Load balancer:

Traffic distribution
Health checks to detect downed servers
Failover when something breaks
L4/L7 balancing depending on whether you're routing by IP or by actual HTTP content.

API gateway:

Rate limiting to prevent abuse.
API aggregation so your client doesn't need to call five different services.
Observability for logging and monitoring.
Authentication and authorization before a request even touches your backend.
Request and response transformation to reshape payloads between client and service formats.

Backends for Frontends

https://learn.microsoft.com/en-us/azure/architecture/patterns/backends-for-frontends

Rate Limiting

https://learn.microsoft.com/en-us/azure/architecture/patterns/rate-limiting-pattern

Diagrams​

System Design​

Coupling​

Synchronous vs Asynchronous communication​

Atomic vs Eventual consistency​

ACID vs BASE transactions​

Orchestration vs choreography coordination​

Sagas​

Distributed Systems​

The Eight Fallacies of Distributed Computing​

Event-driven architecture​

Queue​

Load Balancer vs API Gateway​

Backends for Frontends​

Rate Limiting​