Public Access

Files

gramps 14ec58318b feat: resident runtime, shutdown command, observatory, and IPL logging hardening

- keep BEDS resident after IPL and coordinate clean shutdown
- propagate AMQP shutdown command across dispatcher pool
- add structured IPL milestone/event-chain logging with root GUID context
- add optional trace_on config for verbose method-entry diagnostics
- add dev purge-on-IPL controls for admin/logger collections
- add log level showcase events after IPL node-green
- add Mongo logger store helpers for chain/root lookup and purge
- add/modernize BEDS Observatory log_dumper utility UI and root record view
- refresh source headers and wiki docs for current architecture/runtime
- add architecture visual brief for leadership/image-generation workflows

2026-04-10 13:42:39 -07:00

11 KiB

Raw Permalink Blame History

IPL — Initial Program Load

What Is IPL?

IPL (Initial Program Load) is the BEDS bootstrap sequence. The term comes from IBM mainframe terminology — the process of loading the operating system from disk into memory and starting it. BEDS borrows the term because the concept is identical: a strict, ordered sequence of steps that must all succeed before the node is considered operational.

ipl() is the first function called from main(). If IPL completes successfully, the node is green and enters its operational state. If any required step fails, IPL aborts and the process exits with a console error report.

Current Runtime State (2026-04)

The current Rust runtime behavior has advanced beyond the original POC notes in this page:

main() is now resident after IPL completes.
Dispatcher workers are started as a pool and the process remains active waiting for shutdown.
A broker shutdown command now triggers coordinated process shutdown.
IPL writes structured startup milestones to logger storage (msLogs) after logger init.
Development config can optionally purge selected admin/logger collections on IPL.
Optional trace-style method-entry logging is available through config (trace_on) for deep backend diagnostics.

Treat the sequence below as conceptual IPL ordering; operational lifecycle now includes post-IPL resident runtime and coordinated shutdown.

Why Order Matters

The IPL sequence is not arbitrary. Each step depends on the previous one:

Configuration must load first — every subsequent step reads from it
Logging must initialize second — every subsequent step may emit log events
Runtime templates must validate third — startup must reject invalid template manifests before service checks
RabbitMQ must be reachable fourth — it is the transport for everything, including log event routing to admin
MongoDB must be reachable fifth — it is the log persistence store on the admin node, and the primary document store on appServer
MariaDB must be reachable sixth — it is the relational store; non-critical in dev but required in production

You cannot initialize logging before loading config because the log destination (syslog vs console, mirror settings) is in the config. You cannot validate RabbitMQ before initializing logging because you need logging to report the result. The order is a dependency chain, not a preference.

The IPL Sequence

Step 1: Load Configuration

let cfg = config::load().map_err(|e| format!("Failed to load config: {}", e))?;

Loads config/beds.toml as the base, then merges config/env_{BEDS_ENV}.toml on top. The ? operator short-circuits on failure — if the config cannot be loaded, nothing else runs. This is the only step that is always fatal in every environment, including development. A node without a valid config cannot make any correct decision about anything.

Why fatal everywhere: A missing config is not a recoverable error. It means the node cannot know what it is, where its services are, or how to behave. Continuing would produce undefined behaviour. Fail fast, fail loudly.

Step 2: Initialize Logging

logging::init_from_config(cfg.syslog, cfg.syslog_mirror_console);

Initializes the tracing subscriber with journald and/or console output based on config flags. This must happen before any tracing::info! / tracing::warn! / tracing::error! calls — the tracing macros are no-ops until a subscriber is registered.

Why second: Config is loaded. Logging destination is known. Every step from here on can emit structured log output.

Note on log routing: At this point, log output goes to the local console and/or journald. Log events are not yet routed to the admin node's MongoDB msLogs collection — that requires RabbitMQ to be up (Step 4). Local logging is the fallback that covers the gap between process start and AMQP connectivity.

Step 2b: Load and Validate REC Templates

let runtime_templates = template_registry::load_runtime_rec_templates("templates")

Loads every TOML template under templates/, selects schema = "rec", and validates internal consistency before network service validation begins.

Current validation scope includes:

Every protected field must exist in fields.
Every declared index/cache/regex/partial index field must exist in fields.
Every compound_indexes name must appear in index_name_list.

Why here: this catches template drift and invalid declarations early, before brokers or adapters process traffic.

Environment-aware failure handling:

production: invalid template registry is fatal
all other environments: invalid registry is warning-only so local POC workflows can continue

Step 4: Validate RabbitMQ Reachability (TCP)

match services::amqp::validate(&cfg.broker_services) { ... }

Opens a TCP connection to the configured RabbitMQ broker host and port. Does not authenticate or open an AMQP channel — reachability only. The connection is immediately closed. This is a fast pre-flight check before the more expensive authentication step.

Why RabbitMQ first among services: RabbitMQ is the transport for all inter-node communication, including log event routing. If RabbitMQ is unreachable, the node cannot communicate with the rest of the cluster at all. It cannot send logs to admin, receive work events, or return results. Validating it before other services establishes that the backbone is up.

Environment-aware failure handling:

production: unreachable broker is fatal — the node cannot function
all other environments: unreachable broker is a warning — IPL continues so developers can work on other components without a running broker

Step 4b: Authenticate to RabbitMQ + Declare Exchange

let amqp_conn = match services::amqp::AmqpConnection::connect(&cfg.broker_services).await { ... }

Opens a full AMQP session — credentials, vhost, and channel. Then asserts the beds.events topic exchange as durable. The exchange declaration is idempotent: if the exchange already exists with matching parameters, RabbitMQ returns success; if it exists with conflicting parameters, it returns an error.

AmqpConnection holds the live lapin::Connection and lapin::Channel for the session. The connection is kept as a field to prevent early drop — if the connection is dropped while the channel is live, the channel closes.

Why declare the exchange at IPL? The exchange is the single shared routing infrastructure for the entire cluster. Every node that publishes events depends on it. Declaring it idempotently at startup ensures it always exists before any broker task tries to publish. The first node to start creates it; every subsequent node confirms it.

Queue declaration is not IPL's job. Queues are declared by broker tasks when they start — not here. A queue's presence signals that the broker handling it is alive and ready to consume. IPL only asserts the exchange.

Environment-aware failure handling:

production: authentication failure is fatal
all other environments: failure is a warning — amqp_conn is None, IPL continues

Step 5: Validate MongoDB

match mongo::validate_all(&cfg.rec_services) { ... }

Opens a TCP connection to each configured MongoDB node. One entry per BEDS service role (app_server, admin, segundo, tercero) that has a rec_services config entry.

Why MongoDB before MariaDB: MongoDB is the log persistence store. On the admin node, it is where msLogs lives. On appServer, it is the primary document store for high-throughput collections. It is typically more critical to the core data path than MariaDB, which tends to hold relational reference data.

Environment-aware failure handling: Same pattern as RabbitMQ — fatal in production, warning in development.

Step 6: Validate MariaDB

match mariadb::validate_all(&cfg.rel_services) { ... }

Opens a TCP connection to the master instance of each configured MariaDB node. The secondary (read replica) is also checked, but secondary failure is always non-fatal — BEDS logs a warning and operates in master-only mode.

Why secondary failure is always non-fatal: A missing or unreachable read replica is a degraded state, not a broken state. The node can still serve all operations through the master. Failing hard on a missing replica would cause unnecessary outages during replica maintenance windows.

Environment-aware failure handling: Master failure is fatal in production, warning in development. Secondary failure is a warning in all environments.

Step N (not yet implemented): Node Self-Identification

The node writes its identity record — role, capabilities, env, timestamp — to the msNodes collection. This enables topology visibility for operations tooling. It is not a dependency for the core data path.

Final Step: Node Green

tracing::info!("BEDS IPL complete — node green");

All required services are reachable. The node enters its operational state and begins processing AMQP events.

IPL Failure Handling

In Production

Any required service failure is fatal:

[BEDS] [FATAL] [IPL] RabbitMQ unreachable at broker_services.app_server (localhost:5672): connection refused

Process exits with code 1. The supervisor (systemd, Docker, whatever manages the process) should restart with backoff.

In Development

Service failures are warnings. IPL completes regardless:

WARN  RabbitMQ unreachable (non-fatal in development): connection refused

This allows a developer to work on, say, the MariaDB adapter without needing a running RabbitMQ instance. The tradeoff is that a dev node may start in a degraded state — the developer is expected to notice the warnings.

The `ipl()` Function

ipl() lives in src/main.rs. It is async — required because the AMQP authentication step (lapin::Connection::connect) is an async operation. The Tokio runtime is started by the #[tokio::main] attribute on main().

ipl() returns Result<(), String>. Errors are plain strings — the IPL failure message is written directly to stderr with eprintln! before process::exit(1), because at the point of a fatal IPL failure, the logging system may not be fully operational.

main() is intentionally minimal:

#[tokio::main]
async fn main() {
    if let Err(e) = ipl().await {
        eprintln!("[BEDS] [FATAL] [IPL] {}", e);
        std::process::exit(1);
    }
}

All logic is in ipl(). main() exists only to start the runtime and handle the fatal exit path.

Future IPL Steps

As BEDS matures, the IPL sequence will grow. Expected additions in order:

Node role determination (which services are is_local)
Broker pool startup (spawn Tokio tasks per broker type)
Queue declaration (each broker task declares its own queue on start)
Node self-identification (write identity record to MongoDB)
Signal handler registration (SIGTERM, SIGINT for graceful shutdown)
Node green — begin processing events

11 KiB Raw Permalink Blame History

IPL — Initial Program Load

What Is IPL?

Current Runtime State (2026-04)

Why Order Matters

The IPL Sequence

Step 1: Load Configuration

Step 2: Initialize Logging

Step 2b: Load and Validate REC Templates

Step 4: Validate RabbitMQ Reachability (TCP)

Step 4b: Authenticate to RabbitMQ + Declare Exchange

Step 5: Validate MongoDB

Step 6: Validate MariaDB

Step N (not yet implemented): Node Self-Identification

Final Step: Node Green

IPL Failure Handling

In Production

In Development

The ipl() Function

Future IPL Steps

11 KiB

Raw Permalink Blame History

The `ipl()` Function