Top Distributed Networking Interview Questions: CAP, Consensus, and System Design (2026)
Distributed networking is fundamental to modern systems—from microservices to blockchain to global CDNs. Understanding these concepts is crucial for senior engineering roles. This guide covers the most common interview questions with detailed answers and practical examples.
Fundamentals
1. What is a distributed system?
Answer: A distributed system is a collection of independent computers that appear to users as a single coherent system. Key characteristics:
- Concurrency: Components execute simultaneously
- No global clock: Nodes have independent clocks
- Independent failures: Components can fail without affecting others
- Message passing: Communication via network messages
Examples: Google Search, Netflix streaming, Bitcoin network.
2. Explain the CAP theorem
Answer: The CAP theorem states that a distributed system can only guarantee two of three properties:
- Consistency: All nodes see the same data at the same time
- Availability: Every request receives a response
- Partition tolerance: System continues operating despite network failures
In practice, partition tolerance is required (networks fail), so you choose between CP (consistency) or AP (availability):
CP Systems: MongoDB, HBase, Redis Cluster
- Sacrifice availability during partitions
- Strong consistency guarantees
AP Systems: Cassandra, DynamoDB, CouchDB
- Remain available during partitions
- Eventually consistent
3. What is eventual consistency?
Answer: Eventual consistency guarantees that if no new updates are made, all replicas will eventually converge to the same value. It's a weaker guarantee than strong consistency but enables higher availability.
// Example: DNS propagation
// Update takes time to propagate globally
// Different users may see different values temporarily
// Eventually, all DNS servers have the same record
// Conflict resolution strategies:
// 1. Last-write-wins (LWW) - timestamp-based
// 2. Vector clocks - track causality
// 3. CRDTs - mathematically guaranteed convergence
4. Explain the difference between horizontal and vertical scaling
Answer:
| Aspect | Vertical Scaling | Horizontal Scaling |
|---|---|---|
| Method | Add resources to one machine | Add more machines |
| Cost | Expensive at scale | Commodity hardware |
| Limit | Hardware ceiling | Theoretically unlimited |
| Complexity | Simple | Requires distribution logic |
| Downtime | Usually required | Zero downtime possible |
Consensus and Coordination
5. What is the Raft consensus algorithm?
Answer: Raft is a consensus algorithm for managing a replicated log. It's designed to be understandable (unlike Paxos). Key components:
- Leader election: One node is elected leader, handles all client requests
- Log replication: Leader replicates entries to followers
- Safety: Only nodes with up-to-date logs can become leader
Raft states:
1. Follower - Default state, responds to leader
2. Candidate - Requesting votes for leadership
3. Leader - Handles all client requests
Election process:
1. Follower timeout expires
2. Becomes candidate, increments term
3. Requests votes from peers
4. Majority votes = new leader
5. Sends heartbeats to maintain leadership
6. What is a distributed lock?
Answer: A distributed lock ensures only one process across multiple nodes can access a resource. Implementation challenges:
// Redis distributed lock (Redlock algorithm)
const Redis = require('ioredis');
async function acquireLock(redis, key, ttl) {
const token = crypto.randomUUID();
const result = await redis.set(key, token, 'NX', 'PX', ttl);
return result === 'OK' ? token : null;
}
async function releaseLock(redis, key, token) {
// Lua script for atomic check-and-delete
const script = `
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
`;
return redis.eval(script, 1, key, token);
}
// Usage
const token = await acquireLock(redis, 'my-resource', 30000);
if (token) {
try {
// Critical section
} finally {
await releaseLock(redis, 'my-resource', token);
}
}
7. Explain vector clocks
Answer: Vector clocks track causality between events in distributed systems. Each node maintains a vector of logical timestamps:
// Vector clock example with 3 nodes
// Initial: [0, 0, 0]
// Node A sends message: [1, 0, 0]
// Node B receives and sends: [1, 1, 0]
// Node C receives: [1, 1, 1]
// Comparison rules:
// V1 < V2 if all V1[i] <= V2[i] and at least one V1[i] < V2[i]
// V1 || V2 (concurrent) if neither V1 < V2 nor V2 < V1
class VectorClock {
constructor(nodeId, numNodes) {
this.nodeId = nodeId;
this.clock = new Array(numNodes).fill(0);
}
increment() {
this.clock[this.nodeId]++;
return [...this.clock];
}
update(received) {
for (let i = 0; i < this.clock.length; i++) {
this.clock[i] = Math.max(this.clock[i], received[i]);
}
this.clock[this.nodeId]++;
}
compare(other) {
let less = false, greater = false;
for (let i = 0; i < this.clock.length; i++) {
if (this.clock[i] < other[i]) less = true;
if (this.clock[i] > other[i]) greater = true;
}
if (less && !greater) return -1; // this happened before
if (greater && !less) return 1; // other happened before
return 0; // concurrent
}
}
Networking Protocols
8. Compare TCP vs UDP for distributed systems
Answer:
| Feature | TCP | UDP |
|---|---|---|
| Connection | Connection-oriented | Connectionless |
| Reliability | Guaranteed delivery | Best effort |
| Ordering | Ordered | No ordering |
| Speed | Slower (handshake) | Faster |
| Use cases | HTTP, databases | DNS, streaming, gaming |
// When to use each:
// TCP: When you need reliability
// - Database connections
// - File transfers
// - API calls
// UDP: When speed matters more than reliability
// - Real-time gaming
// - Video streaming
// - DNS queries
// - Health checks
9. What is gRPC and when would you use it?
Answer: gRPC is a high-performance RPC framework using Protocol Buffers and HTTP/2:
- Binary protocol: Smaller payloads than JSON
- Streaming: Bidirectional streaming support
- Code generation: Type-safe clients/servers
- HTTP/2: Multiplexing, header compression
// user.proto
syntax = "proto3";
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc ListUsers(ListUsersRequest) returns (stream User);
rpc CreateUser(User) returns (User);
}
message User {
string id = 1;
string name = 2;
string email = 3;
}
message GetUserRequest {
string id = 1;
}
10. Explain service mesh architecture
Answer: A service mesh is an infrastructure layer for service-to-service communication. Components:
- Data plane: Sidecar proxies (Envoy) handle traffic
- Control plane: Manages proxy configuration
# Istio example - traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
subset: v1
weight: 90
- destination:
host: my-service
subset: v2
weight: 10
Benefits: mTLS, observability, traffic management, retries without code changes.
Fault Tolerance
11. What is the circuit breaker pattern?
Answer: Circuit breaker prevents cascading failures by failing fast when a service is unhealthy:
class CircuitBreaker {
constructor(options) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeout = options.resetTimeout || 30000;
this.state = 'CLOSED';
this.failures = 0;
this.lastFailure = null;
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = 'OPEN';
}
}
}
12. Explain the bulkhead pattern
Answer: Bulkhead isolates components so failures don't cascade. Like ship compartments:
// Thread pool bulkhead
const criticalPool = new ThreadPool({ size: 10 });
const nonCriticalPool = new ThreadPool({ size: 5 });
// If non-critical service exhausts its pool,
// critical operations still have dedicated resources
// Semaphore bulkhead
class Bulkhead {
constructor(maxConcurrent) {
this.maxConcurrent = maxConcurrent;
this.current = 0;
this.queue = [];
}
async execute(fn) {
if (this.current >= this.maxConcurrent) {
throw new Error('Bulkhead full');
}
this.current++;
try {
return await fn();
} finally {
this.current--;
}
}
}
13. What is Byzantine fault tolerance?
Answer: BFT handles nodes that behave arbitrarily (maliciously or due to bugs). The Byzantine Generals Problem:
- Requires 3f+1 nodes to tolerate f Byzantine faults
- Used in blockchain (PBFT, Tendermint)
- More expensive than crash fault tolerance
PBFT phases:
1. Pre-prepare: Leader proposes value
2. Prepare: Nodes broadcast prepared messages
3. Commit: Nodes commit after 2f+1 prepares
4. Reply: Send result to client
Requires 2f+1 matching messages at each phase
Load Balancing and Routing
14. Compare load balancing algorithms
Answer:
// Round Robin - Simple rotation
class RoundRobin {
constructor(servers) {
this.servers = servers;
this.current = 0;
}
next() {
const server = this.servers[this.current];
this.current = (this.current + 1) % this.servers.length;
return server;
}
}
// Weighted Round Robin - Based on capacity
class WeightedRoundRobin {
constructor(servers) {
// servers: [{ host: 'a', weight: 3 }, { host: 'b', weight: 1 }]
this.servers = [];
for (const s of servers) {
for (let i = 0; i < s.weight; i++) {
this.servers.push(s.host);
}
}
this.current = 0;
}
}
// Least Connections - Route to least busy
class LeastConnections {
constructor(servers) {
this.connections = new Map(servers.map(s => [s, 0]));
}
next() {
let min = Infinity, selected;
for (const [server, count] of this.connections) {
if (count < min) {
min = count;
selected = server;
}
}
return selected;
}
}
// Consistent Hashing - For caches/sharding
// Minimizes redistribution when nodes change
15. Explain consistent hashing
Answer: Consistent hashing distributes data across nodes while minimizing redistribution when nodes are added/removed:
const crypto = require('crypto');
class ConsistentHash {
constructor(replicas = 100) {
this.replicas = replicas;
this.ring = new Map();
this.sortedKeys = [];
}
hash(key) {
return crypto.createHash('md5')
.update(key)
.digest('hex')
.substring(0, 8);
}
addNode(node) {
for (let i = 0; i < this.replicas; i++) {
const hash = this.hash(`${node}:${i}`);
this.ring.set(hash, node);
this.sortedKeys.push(hash);
}
this.sortedKeys.sort();
}
removeNode(node) {
for (let i = 0; i < this.replicas; i++) {
const hash = this.hash(`${node}:${i}`);
this.ring.delete(hash);
this.sortedKeys = this.sortedKeys.filter(k => k !== hash);
}
}
getNode(key) {
const hash = this.hash(key);
for (const nodeHash of this.sortedKeys) {
if (hash <= nodeHash) {
return this.ring.get(nodeHash);
}
}
return this.ring.get(this.sortedKeys[0]);
}
}
Messaging and Queues
16. Compare message queue patterns
Answer:
- Point-to-point: One producer, one consumer (task queues)
- Pub/sub: One producer, multiple consumers (event broadcasting)
- Request/reply: Synchronous messaging pattern
Delivery guarantees:
- At most once: May lose messages (fastest)
- At least once: May duplicate (requires idempotency)
- Exactly once: Most complex, often needs transactions
Technologies:
- RabbitMQ: Traditional message broker, AMQP
- Kafka: Distributed log, high throughput
- Redis Streams: Simple, built into Redis
- AWS SQS: Managed, scalable
17. What is the outbox pattern?
Answer: The outbox pattern ensures reliable message publishing with database transactions:
// Instead of:
await db.transaction(async (tx) => {
await tx.insert('orders', order);
await messageQueue.publish('order.created', order); // Can fail!
});
// Use outbox pattern:
await db.transaction(async (tx) => {
await tx.insert('orders', order);
await tx.insert('outbox', {
event_type: 'order.created',
payload: JSON.stringify(order),
created_at: new Date()
});
});
// Separate process polls outbox and publishes
async function processOutbox() {
const events = await db.query(
'SELECT * FROM outbox WHERE processed = false LIMIT 100'
);
for (const event of events) {
await messageQueue.publish(event.event_type, event.payload);
await db.update('outbox', { id: event.id }, { processed: true });
}
}
Observability
18. What is distributed tracing?
Answer: Distributed tracing tracks requests across service boundaries:
// OpenTelemetry example
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function handleRequest(req) {
const span = tracer.startSpan('handleRequest');
try {
span.setAttribute('user.id', req.userId);
// Child span for database call
const dbSpan = tracer.startSpan('database.query', {
parent: span
});
const result = await db.query('...');
dbSpan.end();
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
// Trace context propagation
// W3C Trace Context header: traceparent
// Format: version-trace_id-span_id-flags
// Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
19. Explain the three pillars of observability
Answer:
- Logs: Discrete events with context
- Metrics: Numeric measurements over time
- Traces: Request flow across services
Logs: What happened?
- Structured JSON logs
- Correlation IDs
- Log levels (debug, info, warn, error)
Metrics: How is the system performing?
- Counters: request_count
- Gauges: active_connections
- Histograms: request_duration
Traces: Where did the request go?
- Spans with timing
- Parent-child relationships
- Cross-service context
Security
20. How do you secure service-to-service communication?
Answer:
- mTLS: Mutual TLS authentication
- Service mesh: Automatic mTLS (Istio, Linkerd)
- API keys: Simple but less secure
- JWT: Stateless authentication
# Istio PeerAuthentication for mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: my-namespace
spec:
mtls:
mode: STRICT
System Design Questions
21. Design a distributed rate limiter
Answer:
// Token bucket with Redis
class DistributedRateLimiter {
constructor(redis, options) {
this.redis = redis;
this.capacity = options.capacity;
this.refillRate = options.refillRate; // tokens per second
}
async isAllowed(key) {
const now = Date.now();
const script = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = (now - lastRefill) / 1000
tokens = math.min(capacity, tokens + elapsed * refillRate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', now)
redis.call('EXPIRE', key, 60)
return 1
else
return 0
end
`;
return await this.redis.eval(script, 1, key,
this.capacity, this.refillRate, now);
}
}
22. How would you design a distributed cache?
Answer: Key considerations:
- Partitioning: Consistent hashing across nodes
- Replication: Primary-replica for fault tolerance
- Eviction: LRU, LFU, or TTL-based
- Consistency: Write-through, write-behind, or cache-aside
Cache-aside pattern:
1. Check cache
2. If miss, read from database
3. Update cache
4. Return data
Write-through pattern:
1. Write to cache
2. Cache writes to database
3. Ensures consistency but adds latency
Write-behind pattern:
1. Write to cache
2. Cache async writes to database
3. Better performance, eventual consistency
Conclusion
Distributed systems are complex but understanding these core concepts—consensus, fault tolerance, networking, and observability—will prepare you for senior engineering interviews. Focus on trade-offs: there's rarely a perfect solution, only trade-offs appropriate for specific requirements.