When a telecom switch fails, calls can’t just drop. When a banking system crashes, transactions can’t disappear. When a game server dies, players can’t lose their progress. These industries have solved fault tolerance—here’s how they do it.
The secret is supervision trees: hierarchical structures where parent processes monitor children, automatically restarting them when they fail. This architecture powers systems with “nine nines” (99.9999999%) availability—less than 32 milliseconds of downtime per year.
The Philosophy: Let It Crash
Unlike traditional error handling where you try to anticipate and handle every possible error, OTP embraces a different philosophy: let processes crash, but make sure they restart correctly.
This seems counterintuitive, but it’s incredibly powerful:
- You don’t need defensive code for every edge case
- Fresh process state eliminates corrupted state issues
- Supervisors isolate failures to prevent cascade effects
Understanding Supervision Strategies
:one_for_one
If a child process terminates, only that process is restarted. Other children are unaffected.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| defmodule MyApp.Supervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
children = [
{MyApp.Cache, []},
{MyApp.Worker, []},
{MyApp.Notifier, []}
]
Supervisor.init(children, strategy: :one_for_one)
end
end
|
Use when: Children are independent of each other.
:one_for_all
If any child terminates, all children are terminated and restarted.
1
| Supervisor.init(children, strategy: :one_for_all)
|
Use when: All children depend on each other and must be in sync.
:rest_for_one
If a child terminates, all children started after it are terminated and restarted.
1
2
3
4
5
6
7
| children = [
{MyApp.Database, []}, # If this crashes, all below restart
{MyApp.Cache, []}, # If this crashes, Processor restarts too
{MyApp.Processor, []} # If this crashes, only this restarts
]
Supervisor.init(children, strategy: :rest_for_one)
|
Use when: Children have a dependency chain.
Real-World Supervision Tree
Let’s design a supervision tree for an order processing system:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
| defmodule OrderSystem.Application do
use Application
def start(_type, _args) do
children = [
# Database connections - critical, restart everything if it fails
OrderSystem.Repo,
# Core services supervisor
{OrderSystem.ServicesSupervisor, []},
# Worker pool for order processing
{OrderSystem.OrderProcessorSupervisor, []},
# Web endpoint
OrderSystemWeb.Endpoint
]
opts = [strategy: :one_for_one, name: OrderSystem.Supervisor]
Supervisor.start_link(children, opts)
end
end
defmodule OrderSystem.ServicesSupervisor do
use Supervisor
def start_link(arg) do
Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
end
@impl true
def init(_arg) do
children = [
{OrderSystem.InventoryService, []},
{OrderSystem.PaymentService, []},
{OrderSystem.NotificationService, []}
]
# If inventory fails, payment and notification should restart
# to ensure consistent state
Supervisor.init(children, strategy: :rest_for_one)
end
end
defmodule OrderSystem.OrderProcessorSupervisor do
use DynamicSupervisor
def start_link(arg) do
DynamicSupervisor.start_link(__MODULE__, arg, name: __MODULE__)
end
@impl true
def init(_arg) do
DynamicSupervisor.init(strategy: :one_for_one)
end
def start_processor(order_id) do
spec = {OrderSystem.OrderProcessor, order_id}
DynamicSupervisor.start_child(__MODULE__, spec)
end
end
|
DynamicSupervisor for Runtime Children
When you need to start children dynamically (like one worker per order):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| defmodule OrderSystem.OrderProcessor do
use GenServer, restart: :temporary
def start_link(order_id) do
GenServer.start_link(__MODULE__, order_id,
name: via_tuple(order_id)
)
end
defp via_tuple(order_id) do
{:via, Registry, {OrderSystem.Registry, order_id}}
end
@impl true
def init(order_id) do
# Load order and start processing
order = OrderSystem.Orders.get!(order_id)
send(self(), :process)
{:ok, %{order: order, step: :started}}
end
@impl true
def handle_info(:process, %{order: order, step: :started} = state) do
case process_order(order) do
{:ok, _} ->
{:stop, :normal, state}
{:error, reason} ->
{:stop, reason, state}
end
end
defp process_order(order) do
with {:ok, _} <- validate_inventory(order),
{:ok, _} <- process_payment(order),
{:ok, _} <- update_order_status(order, :completed) do
{:ok, order}
end
end
end
|
Restart Intensity and Period
Control how many restarts are allowed before giving up:
1
2
3
4
5
| Supervisor.init(children,
strategy: :one_for_one,
max_restarts: 3, # Max restarts allowed
max_seconds: 5 # Within this time period
)
|
If the supervisor sees more than 3 restarts in 5 seconds, it gives up and crashes itself—allowing its parent supervisor to handle the situation.
Child Restart Values
Control individual child restart behavior:
1
2
3
4
5
6
7
8
| # Always restart (default for GenServer)
{MyWorker, [], restart: :permanent}
# Only restart on abnormal exit
{MyWorker, [], restart: :transient}
# Never restart
{MyWorker, [], restart: :temporary}
|
Graceful Shutdown
Handle termination properly for cleanup:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| defmodule MyApp.Worker do
use GenServer
@impl true
def init(state) do
Process.flag(:trap_exit, true)
{:ok, state}
end
@impl true
def terminate(reason, state) do
# Clean up resources
save_state_to_disk(state)
close_connections(state)
Logger.info("Worker terminated: #{inspect(reason)}")
:ok
end
end
|
Set shutdown timeout in the supervisor:
1
2
3
4
5
6
7
| children = [
%{
id: MyApp.Worker,
start: {MyApp.Worker, :start_link, []},
shutdown: 30_000 # 30 seconds to clean up
}
]
|
Visualizing Your Supervision Tree
Use Observer to see your tree in action:
1
2
| # In IEx
:observer.start()
|
Or programmatically:
1
2
3
4
5
6
7
8
9
10
11
| def print_tree(supervisor) do
children = Supervisor.which_children(supervisor)
Enum.each(children, fn {id, pid, type, _modules} ->
IO.puts("#{id} (#{inspect(pid)}) - #{type}")
if type == :supervisor do
print_tree(pid)
end
end)
end
|
Conclusion
Supervision trees are the backbone of Elixir’s fault tolerance. By designing thoughtful supervision strategies, you can build systems that automatically recover from failures without human intervention.
Key takeaways:
- Choose the right strategy for your dependency model
- Use DynamicSupervisor for runtime-spawned workers
- Configure restart intensity to prevent restart loops
- Handle graceful shutdown for resource cleanup
At Sajima Solutions, we architect systems using these patterns to deliver reliable services across the Gulf region. Contact us to learn how we can help build your fault-tolerant infrastructure.