rustybeds/wiki/04-ipl.md

# IPL — Initial Program Load

## What Is IPL?

IPL (Initial Program Load) is the BEDS bootstrap sequence. The term comes from IBM mainframe terminology — the process of loading the operating system from disk into memory and starting it. BEDS borrows the term because the concept is identical: a strict, ordered sequence of steps that must all succeed before the node is considered operational.

`ipl()` is the first function called from `main()`. If IPL completes successfully, the node is green and enters its operational state. If any required step fails, IPL aborts and the process exits with a console error report.

## Why Order Matters

The IPL sequence is not arbitrary. Each step depends on the previous one:

1. **Configuration must load first** — every subsequent step reads from it
2. **Logging must initialize second** — every subsequent step may emit log events
3. **RabbitMQ must be reachable third** — it is the transport for everything, including log event routing to admin
4. **MongoDB must be reachable fourth** — it is the log persistence store on the admin node, and the primary document store on appServer
5. **MariaDB must be reachable fifth** — it is the relational store; non-critical in dev but required in production

You cannot initialize logging before loading config because the log destination (syslog vs console, mirror settings) is in the config. You cannot validate RabbitMQ before initializing logging because you need logging to report the result. The order is a dependency chain, not a preference.

## The IPL Sequence

### Step 1: Load Configuration

```rust
let cfg = config::load().map_err(|e| format!("Failed to load config: {}", e))?;
```

Loads `config/beds.toml` as the base, then merges `config/env_{BEDS_ENV}.toml` on top. The `?` operator short-circuits on failure — if the config cannot be loaded, nothing else runs. This is the only step that is always fatal in every environment, including development. A node without a valid config cannot make any correct decision about anything.

**Why fatal everywhere:** A missing config is not a recoverable error. It means the node cannot know what it is, where its services are, or how to behave. Continuing would produce undefined behaviour. Fail fast, fail loudly.

### Step 2: Initialize Logging

```rust
logging::init_from_config(cfg.syslog, cfg.syslog_mirror_console);
```

Initializes the `tracing` subscriber with journald and/or console output based on config flags. This must happen before any `tracing::info!` / `tracing::warn!` / `tracing::error!` calls — the tracing macros are no-ops until a subscriber is registered.

**Why second:** Config is loaded. Logging destination is known. Every step from here on can emit structured log output.

**Note on log routing:** At this point, log output goes to the local console and/or journald. Log events are not yet routed to the admin node's MongoDB `msLogs` collection — that requires RabbitMQ to be up (Step 3). Local logging is the fallback that covers the gap between process start and AMQP connectivity.

### Step 3: Validate RabbitMQ Reachability (TCP)

```rust
match services::amqp::validate(&cfg.broker_services) { ... }
```

Opens a TCP connection to the configured RabbitMQ broker host and port. Does not authenticate or open an AMQP channel — reachability only. The connection is immediately closed. This is a fast pre-flight check before the more expensive authentication step.

**Why RabbitMQ first among services:** RabbitMQ is the transport for all inter-node communication, including log event routing. If RabbitMQ is unreachable, the node cannot communicate with the rest of the cluster at all. It cannot send logs to admin, receive work events, or return results. Validating it before other services establishes that the backbone is up.

**Environment-aware failure handling:**
- `production`: unreachable broker is fatal — the node cannot function
- all other environments: unreachable broker is a warning — IPL continues so developers can work on other components without a running broker

### Step 3b: Authenticate to RabbitMQ + Declare Exchange

```rust
let amqp_conn = match services::amqp::AmqpConnection::connect(&cfg.broker_services).await { ... }
```

Opens a full AMQP session — credentials, vhost, and channel. Then asserts the `beds.events` topic exchange as durable. The exchange declaration is idempotent: if the exchange already exists with matching parameters, RabbitMQ returns success; if it exists with conflicting parameters, it returns an error.

`AmqpConnection` holds the live `lapin::Connection` and `lapin::Channel` for the session. The connection is kept as a field to prevent early drop — if the connection is dropped while the channel is live, the channel closes.

**Why declare the exchange at IPL?** The exchange is the single shared routing infrastructure for the entire cluster. Every node that publishes events depends on it. Declaring it idempotently at startup ensures it always exists before any broker task tries to publish. The first node to start creates it; every subsequent node confirms it.

**Queue declaration is not IPL's job.** Queues are declared by broker tasks when they start — not here. A queue's presence signals that the broker handling it is alive and ready to consume. IPL only asserts the exchange.

**Environment-aware failure handling:**
- `production`: authentication failure is fatal
- all other environments: failure is a warning — `amqp_conn` is `None`, IPL continues

### Step 4: Validate MongoDB

```rust
match mongo::validate_all(&cfg.rec_services) { ... }
```

Opens a TCP connection to each configured MongoDB node. One entry per BEDS service role (app_server, admin, segundo, tercero) that has a `rec_services` config entry.

**Why MongoDB before MariaDB:** MongoDB is the log persistence store. On the admin node, it is where `msLogs` lives. On appServer, it is the primary document store for high-throughput collections. It is typically more critical to the core data path than MariaDB, which tends to hold relational reference data.

**Environment-aware failure handling:** Same pattern as RabbitMQ — fatal in production, warning in development.

### Step 5: Validate MariaDB

```rust
match mariadb::validate_all(&cfg.rel_services) { ... }
```

Opens a TCP connection to the master instance of each configured MariaDB node. The secondary (read replica) is also checked, but secondary failure is always non-fatal — BEDS logs a warning and operates in master-only mode.

**Why secondary failure is always non-fatal:** A missing or unreachable read replica is a degraded state, not a broken state. The node can still serve all operations through the master. Failing hard on a missing replica would cause unnecessary outages during replica maintenance windows.

**Environment-aware failure handling:** Master failure is fatal in production, warning in development. Secondary failure is a warning in all environments.

### Step N (not yet implemented): Node Self-Identification

The node writes its identity record — role, capabilities, env, timestamp — to the `msNodes` collection. This enables topology visibility for operations tooling. It is not a dependency for the core data path.

### Final Step: Node Green

```
tracing::info!("BEDS IPL complete — node green");
```

All required services are reachable. The node enters its operational state and begins processing AMQP events.

## IPL Failure Handling

### In Production

Any required service failure is fatal:

```
[BEDS] [FATAL] [IPL] RabbitMQ unreachable at broker_services.app_server (localhost:5672): connection refused
```

Process exits with code 1. The supervisor (systemd, Docker, whatever manages the process) should restart with backoff.

### In Development

Service failures are warnings. IPL completes regardless:

```
WARN  RabbitMQ unreachable (non-fatal in development): connection refused
```

This allows a developer to work on, say, the MariaDB adapter without needing a running RabbitMQ instance. The tradeoff is that a dev node may start in a degraded state — the developer is expected to notice the warnings.

## The `ipl()` Function

`ipl()` lives in `src/main.rs`. It is `async` — required because the AMQP authentication step (`lapin::Connection::connect`) is an async operation. The Tokio runtime is started by the `#[tokio::main]` attribute on `main()`.

`ipl()` returns `Result<(), String>`. Errors are plain strings — the IPL failure message is written directly to stderr with `eprintln!` before `process::exit(1)`, because at the point of a fatal IPL failure, the logging system may not be fully operational.

`main()` is intentionally minimal:

```rust
#[tokio::main]
async fn main() {
    if let Err(e) = ipl().await {
        eprintln!("[BEDS] [FATAL] [IPL] {}", e);
        std::process::exit(1);
    }
}
```

All logic is in `ipl()`. `main()` exists only to start the runtime and handle the fatal exit path.

## Future IPL Steps

As BEDS matures, the IPL sequence will grow. Expected additions in order:

1. Node role determination (which services are `is_local`)
2. Broker pool startup (spawn Tokio tasks per broker type)
3. Queue declaration (each broker task declares its own queue on start)
4. Node self-identification (write identity record to MongoDB)
5. Signal handler registration (SIGTERM, SIGINT for graceful shutdown)
6. Node green — begin processing events