Infrastructure Intelligence · Vectors Capital
Data Center Architecture
Visual guide to every subsystem · investment signals per layer · 11 custom illustrations
$500B+
2025 Global DC Spend
40%
AI Workload Growth / yr
120kW
GB200 NVL72 Rack Power
3–5yr
Grid Queue Backlog
1.2–1.5
Target PUE
99.9999%
Tier IV Uptime SLA
Power Infrastructure

A modern hyperscale data center draws 10–200 MW of electricity — enough to power tens of thousands of homes. Utility power arrives at high voltage (typically 138 kV), is stepped down through on-site transformers, and then flows through a carefully engineered chain: UPS systems (online double-conversion, now predominantly lithium-ion) buffer the 10–15 seconds required for diesel generators to reach full speed. PDUs (Power Distribution Units) deliver metered, dual-feed power to every rack. The entire chain is designed around N+1 or 2N redundancy — no single point of failure. Power efficiency is measured by PUE (Power Usage Effectiveness); world-class facilities hit 1.1–1.2, meaning only 10–20% of power is lost to overhead. For investors, the acute grid interconnection backlog (3–5 years in most US markets) is driving demand for behind-the-meter generation: onsite solar, battery storage, hydrogen fuel cells, and microgrids are all active investment themes.

UPS Battery Bank
UPS Battery Bank
Online double-conversion · Li-ion · <2ms transfer
Diesel Backup Generator
Diesel Backup Generator
3–5 MW · starts <15 sec · 72hr fuel tank
PDU & Power Delivery
PDU & Power Delivery
Per-outlet metering · dual A+B feeds
Grid Voltage138 kV → 480V stepped down
UPS TypeOnline double-conversion, Li-ion
UPS Bridging10–15 sec until generators start
Generator3–5 MW diesel, <15 sec start, 72hr fuel
PDUPer-outlet metering, dual A+B path
Efficiency~96% (transformer) × 96% (UPS) × 98% (PDU)
💡 Grid interconnection queues are 3–5 years in most US markets — behind-the-meter solar + BESS is the key investment unlock.
❄️
Cooling Systems

Cooling is the defining infrastructure challenge of the AI era. Traditional air cooling tops out at roughly 15–25 kW per rack. NVIDIA's H100 DGX systems already draw ~82 kW, and GB200 NVL72 racks hit 120 kW — four to eight times what air can handle. The industry response is liquid cooling: CDUs (Coolant Distribution Units) circulate water or dielectric fluid through cold plates attached directly to GPUs, extracting heat at the source. At the facility level, cooling towers and chillers reject heat to the atmosphere, consuming 1–3 million gallons of water per day per 100 MW of compute. European regulations increasingly require waste heat reuse — piping excess heat to district heating networks. For investors, liquid cooling infrastructure (cold plates, CDUs, immersion tanks, manifolds) represents the highest-urgency capex category, with retrofit demand from existing air-cooled facilities adding urgency beyond new builds.

Evaporative Cooling Tower
Evaporative Cooling Tower
Final heat rejection · 1–3M gal/day at 100MW
Direct Liquid Cooling — CDU
Direct Liquid Cooling — CDU
GPU cold plates · handles 100kW+ racks
Hot/Cold Aisle Containment
Hot/Cold Aisle Containment
18°C supply · 45°C return · raises efficiency 20%
Air Cooling Limit~25 kW/rack — exceeded by any AI GPU rack
H100 Rack Power82 kW/rack · GB200 NVL72 hits 120 kW
CDU Coolant20°C supply / 44°C return, water or dielectric
Chiller COP5.2 (5.2W cooling per 1W electricity)
Cooling Tower1,200 gal/min · evaporative · final heat sink
Water Use1–3M gallons/day per 100 MW facility
💡 Liquid/immersion cooling is the single most urgent infrastructure shift — air cooling is physically impossible for AI GPU racks.
🔗
Network Fabric

Data center networking operates at two distinct layers. The front-end Ethernet fabric connects servers to the outside world using a spine-leaf topology: a small number of high-radix spine switches (64 ports × 400G = 25.6 Tb/s per switch) interconnect a larger tier of leaf (ToR) switches. This design provides predictable, low-latency paths and easy horizontal scaling. The back-end GPU fabric is entirely different — AI training requires all-to-all collective communication (AllReduce, AllGather) at extreme bandwidth. InfiniBand HDR/NDR (400–800 Gb/s, 600 ns latency, zero-copy RDMA) dominates today, while RoCE (RDMA over Converged Ethernet) provides a lower-cost alternative. The emerging frontier is silicon photonics and co-packaged optics (CPO): replacing copper links with optical interconnects directly integrated into switch ASICs, potentially reducing networking power by 5–10×. Lightmatter, Ayar Labs, and Intel Silicon Photonics are key companies to watch.

Spine-Leaf Clos Topology
Spine-Leaf Clos Topology
BGP/ECMP · zero blocking · scales to 100k+ servers
InfiniBand GPU Fabric
InfiniBand GPU Fabric
400Gb/s · 600ns latency · RDMA zero-copy
Top-of-Rack Switch
Top-of-Rack Switch
48×25G downlinks · 2×100G uplinks · 1 per rack
TopologySpine-leaf Clos · non-blocking · BGP/ECMP
Spine Switch64-port 400G · 25.6 Tb/s per switch
GPU InterconnectInfiniBand 400Gb/s · RDMA · 600ns latency
NVLink (intra-server)900 GB/s total bandwidth across 8× H100
Next WaveSilicon photonics / CPO — 5–10× power reduction
ProtocolRoCEv2 or InfiniBand for GPU-to-GPU RDMA
💡 Silicon photonics (Lightmatter, Ayar Labs) is the deep tech bet — co-packaged optics cut I/O power 5–10× vs copper.
🖥️
Compute Hardware

Modern AI workloads have bifurcated the compute market. Training demands massive, tightly-coupled clusters: NVIDIA's DGX H100 (8 GPUs, 640 GB HBM3, NVLink 900 GB/s, 10,200W) is today's gold standard, while the GB200 NVL72 (72 GPUs per rack, 120 kW) defines the next generation. Inference at scale favors efficient, purpose-built hardware: custom ASICs from Google (TPU), Meta (MTIA), and Microsoft (Maia) deliver far better performance-per-watt than general-purpose GPUs for specific model architectures. CPU servers remain essential for general-purpose compute, orchestration, and data preprocessing — a typical cluster is roughly 1 GPU node per 4 CPU nodes. The capital investment is staggering: a single DGX H100 costs ~$400K; a 1,000-GPU cluster approaches $200M in hardware alone, before networking, power, and cooling. GPU utilization optimization (RunAI, CoreWeave's scheduling software) can reduce effective compute cost by 40–60% by moving typical utilization from 30% to 70%+ — a software-defined capital efficiency play with no hardware cost.

NVIDIA DGX H100 — 8× GPU Server
NVIDIA DGX H100 — 8× GPU Server
10,200W · NVLink 4.0 · requires liquid cooling
CPU Server Rows
CPU Server Rows
Dual-socket · 1U/2U · 80+ Titanium PSUs
Custom AI ASIC Tray
Custom AI ASIC Tray
TPU / Trainium / Maia — 3–10× perf-per-watt vs GPU
H100 DGX8 GPUs · 10,200W · ~$400K/unit · NVLink 900GB/s
GB200 NVL7272 GPUs · 120 kW/rack · requires immersion or DLC
Custom ASICsGoogle TPU v5, AWS Trainium2, Microsoft Maia
GPU UtilizationIndustry avg ~30% — RunAI/Volcano push to 70%+
DPU / SmartNICOffloads networking/security from CPU — NVIDIA BlueField
Form Factor1U/2U CPU · 4U–8U GPU trays · OCP designs
💡 GPU utilization optimization (RunAI acquired ~$700M by NVIDIA) — improving 30%→70% halves effective compute cost.
💾
Storage Systems

AI storage requirements are defined by two competing demands: capacity (storing petabytes of training data, model checkpoints, and inference logs) and throughput (feeding GPU clusters fast enough that compute is never starved). Traditional SANs and NAS are insufficient — AI training at scale requires parallel file systems (Lustre, GPFS, BeeGFS, Weka) that stripe data across hundreds of NVMe drives and deliver aggregate throughput of hundreds of GB/s. For datasets and checkpoints, dense object storage (S3-compatible) at commodity cost sits behind the high-performance tier. The architecture is typically three-tier: ultra-fast NVMe cache for active training data → parallel file system for warm data → object store for cold data and long-term retention. VAST Data ($9B valuation) validated the market for unified high-performance AI storage. NVMe drives have replaced HDDs for performance tiers: a single NVMe delivers 7 GB/s vs. 200 MB/s for spinning disk, at declining cost per GB as 3D NAND scales.

HDD vs NVMe Side-by-Side
HDD vs NVMe Side-by-Side
HDD: $0.015/GB · NVMe: $0.10/GB · 350× faster
High-Density Storage Array
High-Density Storage Array
60× 20TB HDDs in 4U · 1.2 PB per shelf
Parallel File System (WekaIO/VAST)
Parallel File System (WekaIO/VAST)
Feeds GPU clusters at 5+ TB/s aggregate
NVMe SSD7 GB/s · 10–30µs · $0.10/GB — GPU scratch
HDD (NL-SAS)300 MB/s · 5ms · $0.015/GB — bulk storage
Object Store$0.005/GB — model checkpoints, datasets
Parallel FSWekaIO / VAST Data — 5+ TB/s to GPU clusters
VAST Data$9B valuation — validated AI storage market
CXL Memory PoolingNext wave — disaggregated shared DRAM across servers
💡 VAST Data ($9B) proved hyperscale AI storage is a real market. CXL memory pooling is the next architectural shift.
🏗️
Floor Layout

The physical floor plan of a data center is an engineered airflow system. The dominant pattern is the hot aisle / cold aisle arrangement: racks face each other front-to-front (cold aisle, where cool air enters) and back-to-back (hot aisle, where exhaust exits). Cold air is supplied through perforated tiles in a raised floor plenum and hot exhaust is collected at ceiling level and returned to CRAC/CRAH units. At high GPU densities (>25 kW/rack) this approach fails — hot spots form, and air mixing degrades efficiency. Alternatives include in-row cooling (CRACs between rack rows), rear-door heat exchangers, and for the highest densities, direct liquid cooling that eliminates the air loop entirely. Modular designs — prefabricated data center modules (PDCMs) deployed as self-contained units — allow capacity to be added in 2–4 MW increments without full facility builds, dramatically reducing time-to-capacity from 24–36 months to 6–9 months. Site selection increasingly factors in proximity to renewable power, water availability (for cooling towers), seismic risk, and geopolitical stability.

Hot/Cold Aisle — Top View
Hot/Cold Aisle — Top View
3 rows · 2 aisles · overhead cable tray
Server Racks — Cold Aisle View
Server Racks — Cold Aisle View
18°C intake · perforated floor tiles · containment curtains
Modular DC Container
Modular DC Container
250–500kW per module · 6-week deploy time
LayoutHot/cold aisle alternating rows — standard design
Floor TypeRaised floor 18" — perforated tiles supply cold air
Cold Aisle Temp18°C supply air from underfloor plenum
Hot Aisle Temp45°C return air → CRAC units → chilled again
ContainmentCurtains/doors seal aisles — raises PUE by 0.1–0.2
Cable ManagementOverhead fiber trays + power busway separate pathways
💡 Modular containerized DCs (Vertiv, Schneider) cut deployment from 18 months to 6 weeks — critical for AI buildout speed.
📊
Operations & DCIM

A 100 MW data center is a real-time physical system with tens of thousands of sensors generating continuous streams of temperature, power, humidity, and vibration data. DCIM (Data Center Infrastructure Management) software aggregates this telemetry into a unified operational picture — PUE dashboards, capacity planning heat maps, predictive maintenance alerts, and automated remediation workflows. The next frontier is AI-driven operations: DeepMind's WaveNet reduced Google's cooling energy by 30% using reinforcement learning; similar approaches from startups (Vigilant, nOps, Arcadia) target power optimization, workload placement, and failure prediction. Carbon-aware scheduling — shifting compute to times and locations with lower grid carbon intensity — can reduce Scope 2 emissions by 20–30% with no hardware change, using software alone. This is perhaps the highest-leverage, lowest-cost decarbonization lever in the data center stack and represents a significant investment opportunity where pure-software margins meet a regulatory tailwind.

DCIM Dashboard — Live Monitoring
DCIM Dashboard — Live Monitoring
10,000+ sensors · 1-sec polling · carbon-aware scheduling
Temperature Heatmap
Temperature Heatmap
Hot/cold aisle visible · predictive maintenance alerts
Power & UPS Monitoring
Power & UPS Monitoring
Per-outlet metering · battery health · ATS status
DCIM PlatformSchneider EcoStruxure / Vertiv Environet
Sensor Count10,000–50,000+ sensors · 1-second poll frequency
Carbon-AwareShift batch AI jobs to low-carbon grid windows
Scope 2 Savings20–30% Scope 2 reduction with scheduling software alone
Predictive Maint.ML on sensor streams → catch failures before they happen
Uptime TargetTier IV: 99.9999% = <26 min downtime/year
💡 Carbon-aware compute is Vectors' highest-fit opportunity — 20–30% Scope 2 reduction, zero hardware investment, direct fit with corporate decarbonization mandates.