Benchmark

1. Overview

This document provides a comprehensive performance benchmark for VideoDataset, a high-efficiency video decoding backend. The VideoDataset is designed to be used by creating a custom dataset class that inherits from BaseVideoDataset, enabling efficient video data decoding. This document presents a comprehensive performance benchmark analyzing this approach across multiple metrics to quantify its characteristics.

2. Prerequisites

2.1 Benchmark Environment

To ensure reproducible and fair results, all tests were conducted in the following fixed environment:

Component

Specification

Hardware

- CPU: Intel(R) Xeon(R) Platinum 8468

- GPU: NVIDIA H100 SXM5 80GB

- GPU Num 8

Software

- OS: Ubuntu 24.04.3 LTS

- Python: 3.12.3

- PyTorch: 2.7.0a0+79aa17489c.nv25.4

- CUDA: 12.9

- Driver Version: 560.35.03

Note: The Docker image used for running the benchmark will be released later.

2.2 Video Transcoding Preparation

Since the H100 GPU cannot decode AV1 videos, all test videos were pre-transcoded to H.265 (HEVC) format using the following command:

ffmpeg -i input.mp4 -r 30 -c:v libx265  -crf 24 -g 8 -keyint_min 8 -sc_threshold 0 -vf "setpts=N/(30*TB)" -bf 0 -c:a copy output.mp4

Note: The test data required to run the benchmark has been uploaded to Hugging Face: AgiBotWorldAdmin/videodataset-benchmark

3. Benchmark

3.1 Metrics

  • Video Decoding Throughput:

This metric measures the decoding capability of VideoDecoder, expressed in frames per second (FPS), representing the maximum theoretical throughput achievable by the hardware when isolated from dataset operations.

  • Single-GPU Random Access Dataset Throughput:

This metric evaluates the random access throughput​ of the BaseVideoDataset under multi-process loading on a single GPU. It tests how efficiently the dataset infrastructure can serve random samples.

  • DataLoader Throughput:

This measures the efficiency of PyTorch’s DataLoader with BaseVideoDataset across different num_workers configurations on a single GPU. It helps identify the optimal worker count for maximizing data loading performance and reveals bottlenecks in the data loading pipeline.

  • Multi-GPU Data Loading Throughput:

This metric evaluates how the data loading performance scales across multiple GPUs, . It’s essential for understanding multi-GPU training efficiency and identifying potential scaling limitations.

Note: Since the video encoding uses a GOP size of 8, the decoder’s expected actual decoding workload for each output video frame is equivalent to 4 frames. Therefore, when calculating throughput, the count of effectively decoded frames is multiplied by 4.

3.2 Execution

3.2.1 Video Decoding Throughput

You can ​measure​ the Video Decoding Throughput metric by running the benchmarks/decoder_benchmark.py file.

python benchmarks/decoder_benchmark.py --video-path AgiBotWorldAdmin/videodataset-benchmark/videos/observation.images.top_head/chunk-000/file-000.mp4 --num-processes 4

Parameters

Parameter

Value

Description

--video-path

AgiBotWorldAdmin/videodataset-benchmark/videos/observation.images.top_head/chunk-000/file-000.mp4

Video file path

--max-steps

1000

Maximum iteration steps

--warmup-steps

10

Number of warmup steps before timing

--num-processes

4

Number of processes

3.2.2 Single-GPU Random Access Dataset Throughput

You can ​measure​ this metric by running the benchmarks/dataset_benchmark.py file.

python benchmarks/dataset_benchmark.py --repo-id AgiBotWorldAdmin/videodataset-benchmark --num-processes 8

Parameters

Parameter

Value

Description

--repo-id

AgiBotWorldAdmin/videodataset-benchmark

Repo of the dataset

--local-dir

./AgiBotWorldAdmin/videodataset-benchmark

Local dataset path

--warmup-steps

10

Number of warmup steps before timing

--max-steps

1000

Maximum iteration steps

--num-processes

4

Number of processes

3.2.3 DataLoader Throughput

You can ​measure​ this metric by running the benchmarks/base_video_dataset.py file.

python benchmarks/base_video_dataset.py --repo-id AgiBotWorldAdmin/videodataset-benchmark --num-workers 8 16 32

Parameters

Parameter

Value

Description

--repo-id

AgiBotWorldAdmin/videodataset-benchmark

Repo of the dataset

--local-dir

./AgiBotWorldAdmin/videodataset-benchmark

Local dataset path

--num-workers

8

Number of Data Loading Workers

--batch-size

16

Batch size for data loading

--warmup-steps

10

Number of warmup steps before timing

--max-steps

1000

Maximum iteration steps

--world-size

1

Total number of processes in distributed training

3.2.4 Multi-GPU Data Loading Throughput

You can ​measure​ this metric by running the benchmarks/base_video_dataset.py file.

python benchmarks/base_video_dataset.py --repo-id AgiBotWorldAdmin/videodataset-benchmark --num-workers 8 --world-size 2

Parameters

Parameter

Value

Description

--repo-id

AgiBotWorldAdmin/videodataset-benchmark

Repo of the dataset

--local-dir

./AgiBotWorldAdmin/videodataset-benchmark

Local dataset path

--num-workers

8

Number of Data Loading Workers

--batch-size

16

Batch size for data loading

--warmup-steps

10

Number of warmup steps before timing

--max-steps

1000

Maximum iteration steps

--world-size

1

Total number of processes in distributed training

3.3 Results

Note: All the following results were obtained with MPS enabled. Ensure MPS is enabled before executing the benchmark.

3.3.1 Video Decoding Throughput

We ran the benchmark with the following parameters:

python benchmarks/decoder_benchmark.py \
    --video-path AgiBotWorldAdmin/videodataset-benchmark/videos/observation.images.top_head/chunk-000/file-000.mp4 \
    --num-processes 8 \
    --warmup-steps 10 \
    --max-steps 1000

This table show the results:

num-processes

throughput (FPS)

GPU Video Decoder Utilization

8

8249.6676

>=30%

16

15285.96

>=60%

32

22070.7748

>=90%

3.3.2 Single-GPU Random Access Dataset Throughput

We ran the benchmark with the following parameters:

python benchmarks/dataset_benchmark.py \
    --repo-id AgiBotWorldAdmin/videodataset-benchmark \
    --num-processes 8 \
    --warmup-steps 10 \
    --max-steps 1000

This table show the results:

num-processes

throughput (FPS)

GPU Video Decoder Utilization

8

8286.304

>=30%

16

14999.516

>=60%

32

22010.9956

>=85%

3.3.3 DataLoader Throughput

We ran the benchmark with the following parameters:

python benchmarks/base_video_dataset.py \
    --repo-id AgiBotWorldAdmin/videodataset-benchmark \
    --num-workers 8 \
    --batch-size 16 \
    --warmup-steps 10 \
    --max-steps 1000 \
    --world-size 1

This table show the results:

num_workers

throughput (FPS)

GPU Video Decoder Utilization

8

8011.246

>=30%

16

14798.5004

>=60%

32

18447.408

>=80%

3.3.4 Multi-GPU Data Loading Throughput

We ran the benchmark with the following parameters:

python benchmarks/base_video_dataset.py \
    --repo-id AgiBotWorldAdmin/videodataset-benchmark \
    --num-workers 8 \
    --batch-size 16 \
    --warmup-steps 10 \
    --max-steps 1000 \
    --world-size 1

This table show the results:

world-size

Total throughput (FPS)

Single-GPU throughput (FPS)

1

8004.196

8004.196

2

14232.9596

7116.4796

4

25621.792

6405.448

8

42172.896

5271.612