How Telecom Companies Achieve 99.999% Uptime (And You Can Too) | Sajima Solutions

When a telecom switch fails, calls can’t just drop. When a banking system crashes, transactions can’t disappear. When a game server dies, players can’t lose their progress. These industries have solved fault tolerance—here’s how they do it.

The secret is supervision trees: hierarchical structures where parent processes monitor children, automatically restarting them when they fail. This architecture powers systems with “nine nines” (99.9999999%) availability—less than 32 milliseconds of downtime per year.

The Philosophy: Let It Crash

Unlike traditional error handling where you try to anticipate and handle every possible error, OTP embraces a different philosophy: let processes crash, but make sure they restart correctly.

This seems counterintuitive, but it’s incredibly powerful:

You don’t need defensive code for every edge case
Fresh process state eliminates corrupted state issues
Supervisors isolate failures to prevent cascade effects

Understanding Supervision Strategies

:one_for_one

If a child process terminates, only that process is restarted. Other children are unaffected.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
defmodule MyApp.Supervisor do
  use Supervisor

  def start_link(init_arg) do
    Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
  end

  @impl true
  def init(_init_arg) do
    children = [
      {MyApp.Cache, []},
      {MyApp.Worker, []},
      {MyApp.Notifier, []}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

Use when: Children are independent of each other.

:one_for_all

If any child terminates, all children are terminated and restarted.

1
Supervisor.init(children, strategy: :one_for_all)

Use when: All children depend on each other and must be in sync.

:rest_for_one

If a child terminates, all children started after it are terminated and restarted.

1
2
3
4
5
6
7
children = [
  {MyApp.Database, []},      # If this crashes, all below restart
  {MyApp.Cache, []},         # If this crashes, Processor restarts too
  {MyApp.Processor, []}      # If this crashes, only this restarts
]

Supervisor.init(children, strategy: :rest_for_one)

Use when: Children have a dependency chain.

Real-World Supervision Tree

Let’s design a supervision tree for an order processing system:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
defmodule OrderSystem.Application do
  use Application

  def start(_type, _args) do
    children = [
      # Database connections - critical, restart everything if it fails
      OrderSystem.Repo,
      
      # Core services supervisor
      {OrderSystem.ServicesSupervisor, []},
      
      # Worker pool for order processing
      {OrderSystem.OrderProcessorSupervisor, []},
      
      # Web endpoint
      OrderSystemWeb.Endpoint
    ]

    opts = [strategy: :one_for_one, name: OrderSystem.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

defmodule OrderSystem.ServicesSupervisor do
  use Supervisor

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  @impl true
  def init(_arg) do
    children = [
      {OrderSystem.InventoryService, []},
      {OrderSystem.PaymentService, []},
      {OrderSystem.NotificationService, []}
    ]

    # If inventory fails, payment and notification should restart
    # to ensure consistent state
    Supervisor.init(children, strategy: :rest_for_one)
  end
end

defmodule OrderSystem.OrderProcessorSupervisor do
  use DynamicSupervisor

  def start_link(arg) do
    DynamicSupervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  @impl true
  def init(_arg) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  def start_processor(order_id) do
    spec = {OrderSystem.OrderProcessor, order_id}
    DynamicSupervisor.start_child(__MODULE__, spec)
  end
end

DynamicSupervisor for Runtime Children

When you need to start children dynamically (like one worker per order):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
defmodule OrderSystem.OrderProcessor do
  use GenServer, restart: :temporary

  def start_link(order_id) do
    GenServer.start_link(__MODULE__, order_id, 
      name: via_tuple(order_id)
    )
  end

  defp via_tuple(order_id) do
    {:via, Registry, {OrderSystem.Registry, order_id}}
  end

  @impl true
  def init(order_id) do
    # Load order and start processing
    order = OrderSystem.Orders.get!(order_id)
    send(self(), :process)
    {:ok, %{order: order, step: :started}}
  end

  @impl true
  def handle_info(:process, %{order: order, step: :started} = state) do
    case process_order(order) do
      {:ok, _} -> 
        {:stop, :normal, state}
      {:error, reason} ->
        {:stop, reason, state}
    end
  end

  defp process_order(order) do
    with {:ok, _} <- validate_inventory(order),
         {:ok, _} <- process_payment(order),
         {:ok, _} <- update_order_status(order, :completed) do
      {:ok, order}
    end
  end
end

Restart Intensity and Period

Control how many restarts are allowed before giving up:

1
2
3
4
5
Supervisor.init(children, 
  strategy: :one_for_one,
  max_restarts: 3,    # Max restarts allowed
  max_seconds: 5      # Within this time period
)

If the supervisor sees more than 3 restarts in 5 seconds, it gives up and crashes itself—allowing its parent supervisor to handle the situation.

Child Restart Values

Control individual child restart behavior:

1
2
3
4
5
6
7
8
# Always restart (default for GenServer)
{MyWorker, [], restart: :permanent}

# Only restart on abnormal exit
{MyWorker, [], restart: :transient}

# Never restart
{MyWorker, [], restart: :temporary}

Graceful Shutdown

Handle termination properly for cleanup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
defmodule MyApp.Worker do
  use GenServer

  @impl true
  def init(state) do
    Process.flag(:trap_exit, true)
    {:ok, state}
  end

  @impl true
  def terminate(reason, state) do
    # Clean up resources
    save_state_to_disk(state)
    close_connections(state)
    
    Logger.info("Worker terminated: #{inspect(reason)}")
    :ok
  end
end

Set shutdown timeout in the supervisor:

1
2
3
4
5
6
7
children = [
  %{
    id: MyApp.Worker,
    start: {MyApp.Worker, :start_link, []},
    shutdown: 30_000  # 30 seconds to clean up
  }
]

Visualizing Your Supervision Tree

Use Observer to see your tree in action:

1
2
# In IEx
:observer.start()

Or programmatically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def print_tree(supervisor) do
  children = Supervisor.which_children(supervisor)
  
  Enum.each(children, fn {id, pid, type, _modules} ->
    IO.puts("#{id} (#{inspect(pid)}) - #{type}")
    
    if type == :supervisor do
      print_tree(pid)
    end
  end)
end

Conclusion

Supervision trees are the backbone of Elixir’s fault tolerance. By designing thoughtful supervision strategies, you can build systems that automatically recover from failures without human intervention.

Key takeaways:

Choose the right strategy for your dependency model
Use DynamicSupervisor for runtime-spawned workers
Configure restart intensity to prevent restart loops
Handle graceful shutdown for resource cleanup

At Sajima Solutions, we architect systems using these patterns to deliver reliable services across the Gulf region. Contact us to learn how we can help build your fault-tolerant infrastructure.