C++ / LibTorch implementation of NeRF. Given RGB images of a static scene with known camera intrinsics and per-image extrinsics, fits a SIREN-trunk radiance field with Fourier positional encoding and renders RGB and depth from arbitrary novel viewpoints.
![]() |
![]() |
|---|---|
| RGB | Depth |
The radiance field forming over training — the object rotates while the reconstruction sharpens from the noisy SIREN initialisation to the final render:
- Ray generation. For each pixel the camera-space direction
((x − W/2)/f, −(y − H/2)/f, −1)is rotated into world space by the extrinsics; the camera origin is the translation column. - Sampling. Three strategies are supported at train and eval time:
UNIFORM,STRATIFIED(NeRF §5.2: partition the interval into bins and draw one jittered sample per bin), andPROPOSAL(hierarchical: a coarse pass defines a per-ray PDF from which extra points are importance-sampled via inverse transform, concentrating samples near surfaces). For fair comparisons, single-pass samplers use 192 depth samples per ray (n_samples + n_importance); the proposal path uses 64 coarse + 128 importance samples, then evaluates the full radiance field at all 192 merged depths. The coarse pass uses a lightweight density-only proposal head (SirenNeRF::proposal_sigma): Fourier-encoded points with a stop-gradient feed a small trunk + σ head (no view directions, no RGB), so guiding importance sampling is much cheaper than re-running the full model. An interlevel loss (Mip-NeRF 360) trains the coarse proposal histogram to upper-bound the final main-network weight distribution (different bin locations, as in the paper). - Encoding + MLP. Inputs use NeRF-style Fourier positional encoding (L = 10 for xyz, L = 4 for view directions). See Architecture for layer counts.
- Volume render. Discretised quadrature of σ along the ray gives per-sample α and transmittance; the resulting weights composite RGB and depth, alpha-composited onto a configurable background.
- Loss. Pseudo-Huber photometric loss (δ = 0.1) plus the interlevel loss
when training with
PROPOSAL. Ray-batched training draws 8192 rays per step pooled across all training images (forPROPOSALandSTRATIFIED).
Default widths and depths are set in SirenNeRF (include/siren_nerf.h,
src/nerf.cpp). Fourier positional encoding (L = 10 xyz, L = 4 view) is
applied before the SIREN trunks; view directions use the same encoding scheme
as position for the RGB head only.
| Component | Width | SIREN layers | Other layers | Output |
|---|---|---|---|---|
Main trunk (nerf_net_) |
128 | 6 (1 input + 5 hidden) | — | features |
| Main σ head | 128 → 1 | — | linear + softplus | density |
| Main RGB head | 128 + view enc → 128 | 2 | linear + sigmoid | RGB |
Proposal trunk (prop_net_) |
128 | 3 (1 input + 2 hidden) | — | features |
| Proposal σ head | 128 → 1 | — | linear + softplus | density |
Main network total: 6 SIREN layers in the position trunk (view-independent σ), plus 2 SIREN layers in the RGB branch after view concat (8 SIREN layers on the RGB path), with separate linear density/RGB heads. The proposal path uses xyz Fourier encoding only (stop-gradient).
Proposal network total: 3 SIREN layers + 1 linear σ head; no RGB, no view input. Trained only via the interlevel loss, not photometric loss.
Training defaults (src/main.cpp): 160×160 images, TrainSampler::PROPOSAL,
64 coarse + 128 importance samples, 8192 rays/step, AdamW (lr = 5e-4,
weight decay = 1e-2), warmup_iters = 0, interlevel weight 1.0,
pseudo-Huber δ = 0.1, 10000 iterations.
Trained and evaluated on the NeRF synthetic lego and ship scenes
(held-out test split: every 8th view). Training uses 160×160 images,
10000 AdamW iterations (lr = 5e-4, weight decay = 1e-2), ray-batched
PROPOSAL sampling (64 coarse + 128 importance → 192 main-model evals/ray),
pseudo-Huber loss (δ = 0.1), on an NVIDIA RTX A6000. Metrics are averaged
over the 13 test views at the final evaluation step.
| Scene | PSNR ↑ | SSIM ↑ | Output dir |
|---|---|---|---|
| lego | 25.05 | 0.901 | output_lego_160_10k |
| ship | 25.03 | 0.767 | output_ship_160_10k |
At 10k iterations lego reaches 0.901 SSIM (+0.004 vs 6k @ 160) and 25.05 PSNR (+0.26 dB). Ship holds ~25.0 PSNR; SSIM is slightly below earlier 6k ship runs, which is common for this scene at this resolution.
When rendering with the hierarchical PROPOSAL strategy, the coarse pass
can reuse the full radiance field (Option A) or the cheap proposal head
(Option B). Timing one deterministic 120×120 test view, averaged over 10
runs on the RTX A6000 (lego, trained with proposal):
| Coarse pass | ms / view ↓ | vs Option A |
|---|---|---|
| full model (Option A) | 167 | — |
| proposal head (Option B) | 136 | −19 % |
The proposal head cuts coarse-pass latency by ~19 % because it skips view-dependent RGB and uses a narrow density-only trunk. Total hierarchical render time still includes a full 192-sample fine pass; the win is making the coarse PDF step cheap enough to use at both train and eval time without a large quality penalty.
Requires a C++20 compiler, CMake ≥ 3.20, LibTorch (with CUDA if GPU training is wanted), and nlohmann_json.
mkdir build && cd build
cmake ..
make./NeRF.cpp <data_path> <output_path>After training, assemble README GIFs (orbit + training progress):
bash make_gifs.sh output_lego_160_10k 30 10000 100 5 output/load_dataset reads <data_path>/transforms.json — a shared
camera_angle_x (FoV) and a transform_matrix (4×4 camera-to-world)
per frame — plus the referenced image files. Scenes from the NeRF
synthetic dataset (e.g. lego, ship) work directly.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (Mildenhall et al., ECCV 2020)
- Implicit Neural Representations with Periodic Activation Functions (Sitzmann et al., NeurIPS 2020)
- Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields (Barron et al., CVPR 2022)
- cNeRF by rafaelanderka


