Top Distributed Networking Interview Questions: CAP, Consensus, and System Design (2026)

Distributed networking is fundamental to modern systems—from microservices to blockchain to global CDNs. Understanding these concepts is crucial for senior engineering roles. This guide covers the most common interview questions with detailed answers and practical examples.

Fundamentals

1. What is a distributed system?

Answer: A distributed system is a collection of independent computers that appear to users as a single coherent system. Key characteristics:

  • Concurrency: Components execute simultaneously
  • No global clock: Nodes have independent clocks
  • Independent failures: Components can fail without affecting others
  • Message passing: Communication via network messages

Examples: Google Search, Netflix streaming, Bitcoin network.

2. Explain the CAP theorem

Answer: The CAP theorem states that a distributed system can only guarantee two of three properties:

  • Consistency: All nodes see the same data at the same time
  • Availability: Every request receives a response
  • Partition tolerance: System continues operating despite network failures

In practice, partition tolerance is required (networks fail), so you choose between CP (consistency) or AP (availability):

CP Systems: MongoDB, HBase, Redis Cluster
- Sacrifice availability during partitions
- Strong consistency guarantees

AP Systems: Cassandra, DynamoDB, CouchDB
- Remain available during partitions
- Eventually consistent

3. What is eventual consistency?

Answer: Eventual consistency guarantees that if no new updates are made, all replicas will eventually converge to the same value. It's a weaker guarantee than strong consistency but enables higher availability.

// Example: DNS propagation
// Update takes time to propagate globally
// Different users may see different values temporarily
// Eventually, all DNS servers have the same record

// Conflict resolution strategies:
// 1. Last-write-wins (LWW) - timestamp-based
// 2. Vector clocks - track causality
// 3. CRDTs - mathematically guaranteed convergence

4. Explain the difference between horizontal and vertical scaling

Answer:

AspectVertical ScalingHorizontal Scaling
MethodAdd resources to one machineAdd more machines
CostExpensive at scaleCommodity hardware
LimitHardware ceilingTheoretically unlimited
ComplexitySimpleRequires distribution logic
DowntimeUsually requiredZero downtime possible

Consensus and Coordination

5. What is the Raft consensus algorithm?

Answer: Raft is a consensus algorithm for managing a replicated log. It's designed to be understandable (unlike Paxos). Key components:

  • Leader election: One node is elected leader, handles all client requests
  • Log replication: Leader replicates entries to followers
  • Safety: Only nodes with up-to-date logs can become leader
Raft states:
1. Follower - Default state, responds to leader
2. Candidate - Requesting votes for leadership
3. Leader - Handles all client requests

Election process:
1. Follower timeout expires
2. Becomes candidate, increments term
3. Requests votes from peers
4. Majority votes = new leader
5. Sends heartbeats to maintain leadership

6. What is a distributed lock?

Answer: A distributed lock ensures only one process across multiple nodes can access a resource. Implementation challenges:

// Redis distributed lock (Redlock algorithm)
const Redis = require('ioredis');

async function acquireLock(redis, key, ttl) {
  const token = crypto.randomUUID();
  const result = await redis.set(key, token, 'NX', 'PX', ttl);
  return result === 'OK' ? token : null;
}

async function releaseLock(redis, key, token) {
  // Lua script for atomic check-and-delete
  const script = `
    if redis.call("get", KEYS[1]) == ARGV[1] then
      return redis.call("del", KEYS[1])
    else
      return 0
    end
  `;
  return redis.eval(script, 1, key, token);
}

// Usage
const token = await acquireLock(redis, 'my-resource', 30000);
if (token) {
  try {
    // Critical section
  } finally {
    await releaseLock(redis, 'my-resource', token);
  }
}

7. Explain vector clocks

Answer: Vector clocks track causality between events in distributed systems. Each node maintains a vector of logical timestamps:

// Vector clock example with 3 nodes
// Initial: [0, 0, 0]

// Node A sends message: [1, 0, 0]
// Node B receives and sends: [1, 1, 0]
// Node C receives: [1, 1, 1]

// Comparison rules:
// V1 < V2 if all V1[i] <= V2[i] and at least one V1[i] < V2[i]
// V1 || V2 (concurrent) if neither V1 < V2 nor V2 < V1

class VectorClock {
  constructor(nodeId, numNodes) {
    this.nodeId = nodeId;
    this.clock = new Array(numNodes).fill(0);
  }

  increment() {
    this.clock[this.nodeId]++;
    return [...this.clock];
  }

  update(received) {
    for (let i = 0; i < this.clock.length; i++) {
      this.clock[i] = Math.max(this.clock[i], received[i]);
    }
    this.clock[this.nodeId]++;
  }

  compare(other) {
    let less = false, greater = false;
    for (let i = 0; i < this.clock.length; i++) {
      if (this.clock[i] < other[i]) less = true;
      if (this.clock[i] > other[i]) greater = true;
    }
    if (less && !greater) return -1; // this happened before
    if (greater && !less) return 1;  // other happened before
    return 0; // concurrent
  }
}

Networking Protocols

8. Compare TCP vs UDP for distributed systems

Answer:

FeatureTCPUDP
ConnectionConnection-orientedConnectionless
ReliabilityGuaranteed deliveryBest effort
OrderingOrderedNo ordering
SpeedSlower (handshake)Faster
Use casesHTTP, databasesDNS, streaming, gaming
// When to use each:
// TCP: When you need reliability
// - Database connections
// - File transfers
// - API calls

// UDP: When speed matters more than reliability
// - Real-time gaming
// - Video streaming
// - DNS queries
// - Health checks

9. What is gRPC and when would you use it?

Answer: gRPC is a high-performance RPC framework using Protocol Buffers and HTTP/2:

  • Binary protocol: Smaller payloads than JSON
  • Streaming: Bidirectional streaming support
  • Code generation: Type-safe clients/servers
  • HTTP/2: Multiplexing, header compression
// user.proto
syntax = "proto3";

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListUsers(ListUsersRequest) returns (stream User);
  rpc CreateUser(User) returns (User);
}

message User {
  string id = 1;
  string name = 2;
  string email = 3;
}

message GetUserRequest {
  string id = 1;
}

10. Explain service mesh architecture

Answer: A service mesh is an infrastructure layer for service-to-service communication. Components:

  • Data plane: Sidecar proxies (Envoy) handle traffic
  • Control plane: Manages proxy configuration
# Istio example - traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - route:
    - destination:
        host: my-service
        subset: v1
      weight: 90
    - destination:
        host: my-service
        subset: v2
      weight: 10

Benefits: mTLS, observability, traffic management, retries without code changes.

Fault Tolerance

11. What is the circuit breaker pattern?

Answer: Circuit breaker prevents cascading failures by failing fast when a service is unhealthy:

class CircuitBreaker {
  constructor(options) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000;
    this.state = 'CLOSED';
    this.failures = 0;
    this.lastFailure = null;
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

12. Explain the bulkhead pattern

Answer: Bulkhead isolates components so failures don't cascade. Like ship compartments:

// Thread pool bulkhead
const criticalPool = new ThreadPool({ size: 10 });
const nonCriticalPool = new ThreadPool({ size: 5 });

// If non-critical service exhausts its pool,
// critical operations still have dedicated resources

// Semaphore bulkhead
class Bulkhead {
  constructor(maxConcurrent) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.queue = [];
  }

  async execute(fn) {
    if (this.current >= this.maxConcurrent) {
      throw new Error('Bulkhead full');
    }

    this.current++;
    try {
      return await fn();
    } finally {
      this.current--;
    }
  }
}

13. What is Byzantine fault tolerance?

Answer: BFT handles nodes that behave arbitrarily (maliciously or due to bugs). The Byzantine Generals Problem:

  • Requires 3f+1 nodes to tolerate f Byzantine faults
  • Used in blockchain (PBFT, Tendermint)
  • More expensive than crash fault tolerance
PBFT phases:
1. Pre-prepare: Leader proposes value
2. Prepare: Nodes broadcast prepared messages
3. Commit: Nodes commit after 2f+1 prepares
4. Reply: Send result to client

Requires 2f+1 matching messages at each phase

Load Balancing and Routing

14. Compare load balancing algorithms

Answer:

// Round Robin - Simple rotation
class RoundRobin {
  constructor(servers) {
    this.servers = servers;
    this.current = 0;
  }

  next() {
    const server = this.servers[this.current];
    this.current = (this.current + 1) % this.servers.length;
    return server;
  }
}

// Weighted Round Robin - Based on capacity
class WeightedRoundRobin {
  constructor(servers) {
    // servers: [{ host: 'a', weight: 3 }, { host: 'b', weight: 1 }]
    this.servers = [];
    for (const s of servers) {
      for (let i = 0; i < s.weight; i++) {
        this.servers.push(s.host);
      }
    }
    this.current = 0;
  }
}

// Least Connections - Route to least busy
class LeastConnections {
  constructor(servers) {
    this.connections = new Map(servers.map(s => [s, 0]));
  }

  next() {
    let min = Infinity, selected;
    for (const [server, count] of this.connections) {
      if (count < min) {
        min = count;
        selected = server;
      }
    }
    return selected;
  }
}

// Consistent Hashing - For caches/sharding
// Minimizes redistribution when nodes change

15. Explain consistent hashing

Answer: Consistent hashing distributes data across nodes while minimizing redistribution when nodes are added/removed:

const crypto = require('crypto');

class ConsistentHash {
  constructor(replicas = 100) {
    this.replicas = replicas;
    this.ring = new Map();
    this.sortedKeys = [];
  }

  hash(key) {
    return crypto.createHash('md5')
      .update(key)
      .digest('hex')
      .substring(0, 8);
  }

  addNode(node) {
    for (let i = 0; i < this.replicas; i++) {
      const hash = this.hash(`${node}:${i}`);
      this.ring.set(hash, node);
      this.sortedKeys.push(hash);
    }
    this.sortedKeys.sort();
  }

  removeNode(node) {
    for (let i = 0; i < this.replicas; i++) {
      const hash = this.hash(`${node}:${i}`);
      this.ring.delete(hash);
      this.sortedKeys = this.sortedKeys.filter(k => k !== hash);
    }
  }

  getNode(key) {
    const hash = this.hash(key);
    for (const nodeHash of this.sortedKeys) {
      if (hash <= nodeHash) {
        return this.ring.get(nodeHash);
      }
    }
    return this.ring.get(this.sortedKeys[0]);
  }
}

Messaging and Queues

16. Compare message queue patterns

Answer:

  • Point-to-point: One producer, one consumer (task queues)
  • Pub/sub: One producer, multiple consumers (event broadcasting)
  • Request/reply: Synchronous messaging pattern
Delivery guarantees:
- At most once: May lose messages (fastest)
- At least once: May duplicate (requires idempotency)
- Exactly once: Most complex, often needs transactions

Technologies:
- RabbitMQ: Traditional message broker, AMQP
- Kafka: Distributed log, high throughput
- Redis Streams: Simple, built into Redis
- AWS SQS: Managed, scalable

17. What is the outbox pattern?

Answer: The outbox pattern ensures reliable message publishing with database transactions:

// Instead of:
await db.transaction(async (tx) => {
  await tx.insert('orders', order);
  await messageQueue.publish('order.created', order); // Can fail!
});

// Use outbox pattern:
await db.transaction(async (tx) => {
  await tx.insert('orders', order);
  await tx.insert('outbox', {
    event_type: 'order.created',
    payload: JSON.stringify(order),
    created_at: new Date()
  });
});

// Separate process polls outbox and publishes
async function processOutbox() {
  const events = await db.query(
    'SELECT * FROM outbox WHERE processed = false LIMIT 100'
  );

  for (const event of events) {
    await messageQueue.publish(event.event_type, event.payload);
    await db.update('outbox', { id: event.id }, { processed: true });
  }
}

Observability

18. What is distributed tracing?

Answer: Distributed tracing tracks requests across service boundaries:

// OpenTelemetry example
const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handleRequest');

  try {
    span.setAttribute('user.id', req.userId);

    // Child span for database call
    const dbSpan = tracer.startSpan('database.query', {
      parent: span
    });
    const result = await db.query('...');
    dbSpan.end();

    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
}

// Trace context propagation
// W3C Trace Context header: traceparent
// Format: version-trace_id-span_id-flags
// Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

19. Explain the three pillars of observability

Answer:

  • Logs: Discrete events with context
  • Metrics: Numeric measurements over time
  • Traces: Request flow across services
Logs: What happened?
- Structured JSON logs
- Correlation IDs
- Log levels (debug, info, warn, error)

Metrics: How is the system performing?
- Counters: request_count
- Gauges: active_connections
- Histograms: request_duration

Traces: Where did the request go?
- Spans with timing
- Parent-child relationships
- Cross-service context

Security

20. How do you secure service-to-service communication?

Answer:

  • mTLS: Mutual TLS authentication
  • Service mesh: Automatic mTLS (Istio, Linkerd)
  • API keys: Simple but less secure
  • JWT: Stateless authentication
# Istio PeerAuthentication for mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: my-namespace
spec:
  mtls:
    mode: STRICT

System Design Questions

21. Design a distributed rate limiter

Answer:

// Token bucket with Redis
class DistributedRateLimiter {
  constructor(redis, options) {
    this.redis = redis;
    this.capacity = options.capacity;
    this.refillRate = options.refillRate; // tokens per second
  }

  async isAllowed(key) {
    const now = Date.now();
    const script = `
      local key = KEYS[1]
      local capacity = tonumber(ARGV[1])
      local refillRate = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])

      local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
      local tokens = tonumber(bucket[1]) or capacity
      local lastRefill = tonumber(bucket[2]) or now

      -- Refill tokens
      local elapsed = (now - lastRefill) / 1000
      tokens = math.min(capacity, tokens + elapsed * refillRate)

      if tokens >= 1 then
        tokens = tokens - 1
        redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', now)
        redis.call('EXPIRE', key, 60)
        return 1
      else
        return 0
      end
    `;

    return await this.redis.eval(script, 1, key,
      this.capacity, this.refillRate, now);
  }
}

22. How would you design a distributed cache?

Answer: Key considerations:

  • Partitioning: Consistent hashing across nodes
  • Replication: Primary-replica for fault tolerance
  • Eviction: LRU, LFU, or TTL-based
  • Consistency: Write-through, write-behind, or cache-aside
Cache-aside pattern:
1. Check cache
2. If miss, read from database
3. Update cache
4. Return data

Write-through pattern:
1. Write to cache
2. Cache writes to database
3. Ensures consistency but adds latency

Write-behind pattern:
1. Write to cache
2. Cache async writes to database
3. Better performance, eventual consistency

Conclusion

Distributed systems are complex but understanding these core concepts—consensus, fault tolerance, networking, and observability—will prepare you for senior engineering interviews. Focus on trade-offs: there's rarely a perfect solution, only trade-offs appropriate for specific requirements.