A modern hyperscale data center draws 10–200 MW of electricity — enough to power tens of thousands of homes. Utility power arrives at high voltage (typically 138 kV), is stepped down through on-site transformers, and then flows through a carefully engineered chain: UPS systems (online double-conversion, now predominantly lithium-ion) buffer the 10–15 seconds required for diesel generators to reach full speed. PDUs (Power Distribution Units) deliver metered, dual-feed power to every rack. The entire chain is designed around N+1 or 2N redundancy — no single point of failure. Power efficiency is measured by PUE (Power Usage Effectiveness); world-class facilities hit 1.1–1.2, meaning only 10–20% of power is lost to overhead. For investors, the acute grid interconnection backlog (3–5 years in most US markets) is driving demand for behind-the-meter generation: onsite solar, battery storage, hydrogen fuel cells, and microgrids are all active investment themes.
Cooling is the defining infrastructure challenge of the AI era. Traditional air cooling tops out at roughly 15–25 kW per rack. NVIDIA's H100 DGX systems already draw ~82 kW, and GB200 NVL72 racks hit 120 kW — four to eight times what air can handle. The industry response is liquid cooling: CDUs (Coolant Distribution Units) circulate water or dielectric fluid through cold plates attached directly to GPUs, extracting heat at the source. At the facility level, cooling towers and chillers reject heat to the atmosphere, consuming 1–3 million gallons of water per day per 100 MW of compute. European regulations increasingly require waste heat reuse — piping excess heat to district heating networks. For investors, liquid cooling infrastructure (cold plates, CDUs, immersion tanks, manifolds) represents the highest-urgency capex category, with retrofit demand from existing air-cooled facilities adding urgency beyond new builds.
Data center networking operates at two distinct layers. The front-end Ethernet fabric connects servers to the outside world using a spine-leaf topology: a small number of high-radix spine switches (64 ports × 400G = 25.6 Tb/s per switch) interconnect a larger tier of leaf (ToR) switches. This design provides predictable, low-latency paths and easy horizontal scaling. The back-end GPU fabric is entirely different — AI training requires all-to-all collective communication (AllReduce, AllGather) at extreme bandwidth. InfiniBand HDR/NDR (400–800 Gb/s, 600 ns latency, zero-copy RDMA) dominates today, while RoCE (RDMA over Converged Ethernet) provides a lower-cost alternative. The emerging frontier is silicon photonics and co-packaged optics (CPO): replacing copper links with optical interconnects directly integrated into switch ASICs, potentially reducing networking power by 5–10×. Lightmatter, Ayar Labs, and Intel Silicon Photonics are key companies to watch.
Modern AI workloads have bifurcated the compute market. Training demands massive, tightly-coupled clusters: NVIDIA's DGX H100 (8 GPUs, 640 GB HBM3, NVLink 900 GB/s, 10,200W) is today's gold standard, while the GB200 NVL72 (72 GPUs per rack, 120 kW) defines the next generation. Inference at scale favors efficient, purpose-built hardware: custom ASICs from Google (TPU), Meta (MTIA), and Microsoft (Maia) deliver far better performance-per-watt than general-purpose GPUs for specific model architectures. CPU servers remain essential for general-purpose compute, orchestration, and data preprocessing — a typical cluster is roughly 1 GPU node per 4 CPU nodes. The capital investment is staggering: a single DGX H100 costs ~$400K; a 1,000-GPU cluster approaches $200M in hardware alone, before networking, power, and cooling. GPU utilization optimization (RunAI, CoreWeave's scheduling software) can reduce effective compute cost by 40–60% by moving typical utilization from 30% to 70%+ — a software-defined capital efficiency play with no hardware cost.
AI storage requirements are defined by two competing demands: capacity (storing petabytes of training data, model checkpoints, and inference logs) and throughput (feeding GPU clusters fast enough that compute is never starved). Traditional SANs and NAS are insufficient — AI training at scale requires parallel file systems (Lustre, GPFS, BeeGFS, Weka) that stripe data across hundreds of NVMe drives and deliver aggregate throughput of hundreds of GB/s. For datasets and checkpoints, dense object storage (S3-compatible) at commodity cost sits behind the high-performance tier. The architecture is typically three-tier: ultra-fast NVMe cache for active training data → parallel file system for warm data → object store for cold data and long-term retention. VAST Data ($9B valuation) validated the market for unified high-performance AI storage. NVMe drives have replaced HDDs for performance tiers: a single NVMe delivers 7 GB/s vs. 200 MB/s for spinning disk, at declining cost per GB as 3D NAND scales.
The physical floor plan of a data center is an engineered airflow system. The dominant pattern is the hot aisle / cold aisle arrangement: racks face each other front-to-front (cold aisle, where cool air enters) and back-to-back (hot aisle, where exhaust exits). Cold air is supplied through perforated tiles in a raised floor plenum and hot exhaust is collected at ceiling level and returned to CRAC/CRAH units. At high GPU densities (>25 kW/rack) this approach fails — hot spots form, and air mixing degrades efficiency. Alternatives include in-row cooling (CRACs between rack rows), rear-door heat exchangers, and for the highest densities, direct liquid cooling that eliminates the air loop entirely. Modular designs — prefabricated data center modules (PDCMs) deployed as self-contained units — allow capacity to be added in 2–4 MW increments without full facility builds, dramatically reducing time-to-capacity from 24–36 months to 6–9 months. Site selection increasingly factors in proximity to renewable power, water availability (for cooling towers), seismic risk, and geopolitical stability.
A 100 MW data center is a real-time physical system with tens of thousands of sensors generating continuous streams of temperature, power, humidity, and vibration data. DCIM (Data Center Infrastructure Management) software aggregates this telemetry into a unified operational picture — PUE dashboards, capacity planning heat maps, predictive maintenance alerts, and automated remediation workflows. The next frontier is AI-driven operations: DeepMind's WaveNet reduced Google's cooling energy by 30% using reinforcement learning; similar approaches from startups (Vigilant, nOps, Arcadia) target power optimization, workload placement, and failure prediction. Carbon-aware scheduling — shifting compute to times and locations with lower grid carbon intensity — can reduce Scope 2 emissions by 20–30% with no hardware change, using software alone. This is perhaps the highest-leverage, lowest-cost decarbonization lever in the data center stack and represents a significant investment opportunity where pure-software margins meet a regulatory tailwind.