Configuring NVLink on a VPS with Multi-GPU Setup

NVLink, NVIDIA’s high-bandwidth GPU interconnect, revolutionizes multi-GPU workloads by enabling rapid memory sharing and data transfer, far surpassing PCIe limitations. For data scientists, ML engineers, or HPC users, NVLink can accelerate tasks like deep learning model training or large-scale simulations. For example, you could use NVLink with dual A100 GPUs to train a large language model faster by pooling GPU memory. However, configuring NVLink in a virtualized environment is complex and requires specific hardware and setup. This guide explains NVLink, its limitations in a VPS, and how to configure it for optimal performance.

Limitations and Caveats

Not all VPS providers support NVLink setups.
Only bare-metal-based VPS or dedicated GPU virtual machines with direct passthrough allow NVLink to work.
No NVLink support in containerized environments like Docker unless run on the host directly.

What Is NVLink?

NVLink allows two or more compatible NVIDIA GPUs to:

Share memory across GPUs for large datasets
Exchange data at up to 600 GB/s total bandwidth
Perform faster multi-GPU training without CPU involvement

Supported on GPUs like:

NVIDIA A100, V100, RTX 3090, 4090, A6000, etc.
Usually requires a physical NVLink bridge

NVLink in VPS: Prerequisites

Before attempting to configure NVLink on a VPS, ensure the following:

Host Hardware

The physical server must have:
- At least two NVLink-compatible GPUs
- NVLink bridge(s) installed
- BIOS and firmware that supports NVLink
Common compatible setups include dual A100 or RTX 3090 with NVLink bridge.

VPS Configuration

The VPS must be provisioned on a GPU passthrough-enabled hypervisor, like:
- KVM/QEMU with VFIO (PCI passthrough)
- VMware ESXi with DirectPath I/O
- Proxmox VE with GPU passthrough

⚠️ Note: NVLink does not work across virtualized devices unless both GPUs are passed through as full PCIe devices to the same VM.

Step-by-Step: How to Configure NVLink on a VPS

Step 1: Ensure Passthrough of GPUs

The host needs to pass both physical GPUs directly to your VPS.

For KVM/QEMU with VFIO:

Update libvirt or qemu XML to pass both GPUs through.

Step 2: Install NVIDIA Drivers

Inside the VPS (guest OS), install the latest NVIDIA driver:

Reboot after installation.

Step 3: Verify NVLink Topology

Once inside the guest OS:

You should see:

	GPU0	GPU1	CPU Affinity
GPU0	X	NV1	0-15
GPU1	NV1	X	0-15

Where NV1 means NVLink is active between GPU0 and GPU1.

Step 4: Enable Peer-to-Peer Access (Optional but Recommended)

Ensure Peer-to-Peer and Access are both marked as Enabled.

Security Considerations

Isolated access: Ensure your VPS is not oversubscribed or co-hosted with others when using full GPU passthrough.
No shared memory leakage: NVLink creates a shared memory space—limit access to trusted environments.
Audit access to /dev/nvidia devices*.

Troubleshooting NVLink Issues

Symptom	Possible Cause	Fix
NVLink not shown in nvidia-smi	GPUs not bridged properly	Power off host and reinstall physical NVLink bridge
Only one GPU visible	Passthrough misconfiguration	Check VM XML/device passthrough settings
Peer-to-peer disabled	Driver mismatch or BIOS settings	Upgrade driver, check BIOS for NVLink support
Low bandwidth	NVLink lanes underutilized	Use nvidia-smi nvlink –status to verify lanes

NVLink is a game-changer for GPU-intensive workloads, offering immense performance advantages when properly configured—even in virtual environments. With direct GPU passthrough and careful setup, you can harness the power of multi-GPU interconnects on a VPS, turning it into a high-performance computing node for demanding applications.

How to Configure NVLink on a VPS with Multi-GPU Setup