System Design
Welcome to the Master Guide for System Design . These questions are curated for high-level technical rounds at companies like Amazon, Google, Adobe, and startups . Master the art of scalability, availability, and reliability.
On this page:
1. Scaling & Load Balancing
Q1: What is the difference between Vertical and Horizontal Scaling? Easy +
Horizontal Scaling: Also known as "Scaling Out," it involves adding more servers to your pool to share the workload.
[Image of Vertical vs Horizontal Scaling]
Pro-Tip: Horizontal scaling is preferred for modern distributed systems because it offers High Availability and avoids a "Single Point of Failure."
Q2: How does a Round Robin Load Balancer work? Medium +
Limitations: It assumes all servers have equal processing power. For servers with different capacities, we use Weighted Round Robin .
Q3: What is "Sticky Session" in Load Balancing? Medium +
This is useful when the server stores user session data locally, but it can lead to unbalanced loads if one user performs heavy tasks.
How do you make a Singleton class thread-safe in Java? Hard +
if (instance == null)
block simultaneously.
1. Double-Checked Locking (Recommended for Performance)
This is the most efficient way. We use a
volatile
keyword to ensure visibility and a
synchronized
block to handle the lock.
public class DatabaseConnection {
private static volatile DatabaseConnection instance;
private DatabaseConnection() {} // Private Constructor public
static DatabaseConnection getInstance() {
if (instance == null) { // First check
synchronized (DatabaseConnection.class) {
if (instance == null) { // Second check
instance = new DatabaseConnection();
}
}
}
return instance;
}
}
2. Bill Pugh Singleton (The Industry Standard)
This uses a "Static Inner Helper Class." The JVM only loads the inner
class into memory when
getInstance()
is called, making it thread-safe and lazy-loaded without needing synchronization
blocks.
public class Singleton {
private Singleton() {}
private static class SingletonHelper {
private static final Singleton INSTANCE = new Singleton();
}
public static Singleton getInstance() {
return SingletonHelper.INSTANCE;
}
}
3. Enum Singleton (The Safest Way)
As per Joshua Bloch (Effective Java) , an Enum is the best way to implement a Singleton. It provides 100% thread safety and prevents "Reflection" or "Serialization" from creating a second instance.
public enum Logger {
INSTANCE;
public void log(String msg) {
System.out.println(msg);
}
}
Interview Pro-Tip: Always mention the volatile keyword in Double-Checked Locking. Without it, a thread might see a half-initialized object due to "Instruction Reordering" by the JVM.
How do you process a 10GB file or millions of database records in Java without crashing the RAM? Hard +
1. Reading Large Files (Files.lines)
Never use Files.readAllLines() for large files. Instead, use Files.lines(), which returns a Stream and reads the file line-by-line.
try (Stream<String> lines = Files.lines(Paths.get("large_log_file.txt"))) {
lines.filter(line -> line.contains("ERROR"))
.map(String::toUpperCase)
.forEach(System.out::println);
} catch (IOException e) {
e.printStackTrace();
}
2. Database Streaming (Spring Data JPA)
When fetching millions of rows, avoid List<Entity>. Use the Stream return type in your Repository. This uses a Database Cursor to fetch rows one by one.
@Query("select u from User u")
Stream<User> getAllUsersStream();
// Inside Service (Must be @Transactional)
try (Stream<User> userStream = userRepository.getAllUsersStream()) {
userStream.forEach(user -> process(user));
}
3. Parallel Streams for Performance
If the task is CPU-intensive (like calculating a hash for every record), use .parallelStream(). This splits the data across multiple CPU cores using the ForkJoinPool.
Why this is the "Senior" Approach:
- Memory Efficiency: Only 1 record is in memory at a time.
- Pipelining: Multiple operations (filter, map, sort) are combined into a single pass over the data.
- Short-circuiting: Operations like
findFirst()stop the processing immediately once the result is found, saving CPU cycles.
⚠️ Crucial Warning: Always use a try-with-resources block when streaming from files or databases to ensure the underlying resources (file handles/connections) are closed.
How do you design a 100% Immutable Class in Java? Hard +
The 5 Strict Rules:
- Declare the class as
final: This prevents other classes from extending it and overriding its methods. - Make all fields
privateandfinal: Private ensures encapsulation, and final ensures they are initialized only once. - No Setter methods: Do not provide any methods that can change the state of the fields.
- Initialize all fields via Constructor: Perform a Deep Copy for mutable objects (like Lists or Dates) during initialization.
- Return Deep Copies in Getters: Never return the actual reference of a mutable object; return a copy instead.
import java.util.ArrayList;
import java.util.List;
public final class Student {
private final String name;
private final List<String> courses;
public Student(String name, List<String> courses) {
this.name = name;
// Deep Copy: Don't just do this.courses = courses;
this.courses = new ArrayList<>(courses);
}
public String getName() {
return name;
}
public List<String> getCourses() {
// Return a copy to prevent the caller from modifying the list
return new ArrayList<>(courses);
}
}
Why use Immutability in System Design?
- Thread Safety: Since the state never changes, multiple threads can access it without synchronization.
- Caching: Immutable objects are perfect keys for
HashMapbecause theirhashCodenever changes. - Reliability: It prevents "Side Effects" where one part of the code accidentally changes data used by another part.
Pro-Tip for Java 17: You can use Records to create immutable data carriers instantly: public record Student(String name, List<String> courses) {}. However, note that Records perform a shallow copy, so you still need to manually handle mutable lists for 100% immutability.
Scenario: Users are reporting 500 Errors in Production. What is your step-by-step debugging process? Hard +
Step 1: Detection & Verification (The "What")
- Check Monitoring Tools (Prometheus, Grafana, Datadog) to see the scale. Is it 1% of users or 100%?
- Identify the Impact Area: Is it a specific region (Indore/Mumbai) or a specific feature (Payment/Login)?
Use the ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk to search for Error Logs.
- Correlation IDs: Trace a single failed request across multiple Microservices using tools like Zipkin or Jaeger.
- Recent Changes: Check the CI/CD pipeline. Was a new deployment made in the last 30 minutes? (If yes, Rollback immediately).
Do not try to "fix the code" while the site is down. Restore service first:
- Rollback: Revert to the last stable version.
- Restart/Scale: If it's a memory leak or high CPU, restart the pods or add more nodes.
- Circuit Breaking: If a 3rd party API (like a Payment Gateway) is down, enable the circuit breaker to show a "Service Unavailable" message instead of crashing the app.
Once the site is stable, perform a deep dive:
// Example: Checking Heap Dump for Memory Leak jmap -dump:live,format=b,file=heap.bin [pid] // Analyze heap.bin in Eclipse MAT to find the leaking object.Step 5: Post-Mortem (The "Never Again")
- Update Unit Tests to catch this specific bug in the future.
- Add Alerting Rules so you get notified before the users do.
Interview Tip: Mention that "Rollback is always the first choice" if a recent deployment happened. Don't try to "fix-forward" in the middle of a major outage.
Q: Your Java service is slow, but CPU usage is under 10%. What do you investigate first? Hard +
1. Database Connection Pool Exhaustion
This is the #1 cause. If your connection pool (e.g., HikariCP) is set to 10 but you have 100 concurrent users, 90 users will wait for a connection to open. The CPU stays idle because the threads are just sitting there.
- Check: Hikari metrics or
ActiveConnectionscount. - Fix: Increase pool size or optimize slow SQL queries.
Multiple threads might be fighting for the same synchronized block or a ReentrantLock. One thread holds the lock for too long, and others form a queue.
- Check: Take a Thread Dump using
jstack [pid]. Look for threads inBLOCKEDstatus.
If your service calls a third-party API (like a Payment Gateway or SMS service) and that service is slow, your Java thread stays open waiting for the HTTP response. Since it's waiting for the network, it uses 0% CPU.
- Check: Distributed Tracing (Zipkin/Jaeger) to see which downstream service is slow.
- Fix: Implement Timeouts and Circuit Breakers (Resilience4j).
In some cases, specific GC phases (like a Full GC in the Old Generation) pause all application threads. While the threads are paused, they don't use CPU for application logic.
- Check: GC logs using
-XX:+PrintGCDetailsor tools like GCEasy.io.
If the Heap is almost full, the JVM spends all its time trying to find a tiny bit of space to allocate new objects. This constant "GC Thrashing" slows down the app without heavy CPU computation.
Pro-Tip: Use VisualVM or JConsole to see the Live Thread Count. If you see many threads in a "Yellow" (Waiting) state, you have found your bottleneck.
Q: A Thread Pool is configured correctly (e.g., 20 core threads), but tasks are still delayed. Why? Hard +
1. The "Unbounded Queue" Problem
If you use a LinkedBlockingQueue without a capacity limit, tasks will sit in the queue indefinitely if they arrive faster than they can be processed. The threads are working fine, but the Queue Wait Time is killing your performance.
- Check: Monitor
getQueue().size(). - Fix: Use a bounded queue and implement a RejectedExecutionHandler.
If your "correct configuration" actually exceeds the number of available CPU cores (e.g., 200 threads on an 8-core machine), the OS spends more time switching between threads than actually executing code.
- Check: Use
vmstatortopto look for high "cs" (context switches). - Rule: For CPU-bound tasks, Pool Size should be
N + 1(where N = cores).
Your 20 threads might all be trying to access the same Synchronized block, a Database connection, or a shared File handle. 1 thread works while 19 threads wait in a "BLOCKED" state.
- Check: Thread Dump (
jstack). Look for "waiting to lock <0x000...>"
If the JVM is performing a Full GC, it triggers a "Stop-The-World" event. Every single thread in your "correctly configured" pool will be paused. To the user, it looks like a delay, even though the pool is technically fine.
- Check: GC Logs or
jstat -gcutil.
If other processes on the same server (like a heavy backup script or another service) are consuming all the CPU/IO, your threads won't get enough CPU time slices from the OS scheduler.
The "Senior" Troubleshooting Formula:
Task Latency = Queue Wait Time + Execution Time + Blocked Time
If Execution Time is low, focus 100% on Queue Wait Time and Blocked Time.
Q: Your application throws OutOfMemoryError, but the Heap is only 40% full. What is happening? Hard +
java.lang.OutOfMemoryError does not always mean the Heap is full. It means the JVM cannot allocate memory somewhere.
1. Metaspace Exhaustion (java.lang.OutOfMemoryError: Metaspace)
The Metaspace stores class metadata (method names, field types, etc.). If your app uses heavy Reflection or dynamic class loading (like Spring, Hibernate, or CGLIB), the Metaspace can fill up.
- Check:
jstat -gc [pid]or check Metaspace charts in VisualVM. - Fix: Increase
-XX:MaxMetaspaceSize.
Every time you start a new Thread(), the JVM requests memory from the OS for the Thread Stack (usually 1MB per thread). If the OS runs out of RAM or hits the "max user processes" limit (ulimit), the JVM throws OOM even if the Heap is empty.
- Check:
ulimit -uon Linux or look for "Native Thread" in the error log. - Fix: Use a Thread Pool instead of creating new threads manually.
High-performance libraries like Netty or NIO use "Direct Buffers" which sit outside the JVM Heap to avoid copying data. If these aren't cleared, you get a "Direct buffer memory" OOM.
- Check: Use
-XX:NativeMemoryTracking=detailandjcmdto track native allocations.
This happens when the JVM spends 98% of its time doing Garbage Collection but recovers less than 2% of the Heap. Even if the Heap isn't technically "Full," the JVM gives up because it can't make progress.
- Fix: Identify the "GC Thrashing" cause using a Heap Dump.
In 64-bit JVMs, a special part of Metaspace called "Compressed Class Space" has a default limit of 1GB. If you load too many classes, this can fail before the main Metaspace is full.
The "Senior" Troubleshooting Step:
Always check the exact message following the OutOfMemoryError.
Is itJava heap space?Metaspace?Direct buffer memory? orUnable to create new native thread?
🚀 Mastered System Design?
Join our community for daily architecture deep-dives and placement resources!
Finished studying this topic?
← Return to Home Page
Comments
Post a Comment
If you any doubts, please let me know Telegram Id: @Coding_Helperr
Instagram Id : codingsolution75