VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection

1National University of Singapore, 2The Australian National University, 3The University of New South Wales, 4New York University (*Corresponding authors)
Taxonomy Sunburst Chart of VPD-100K Qualitative visual performance comparison
(a) VPD-100K Fine-grained Taxonomy
(b) Visual Performance of our Proposed Detector

Abstract

Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in real-world environments.

To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming video benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the well-curated frequency mechanism.

🎯 The VPD-100K Dataset

Existing datasets overlook critical leakage sources like on-screen PII or suffer from coarse categorization (e.g., just "person" or "text"). VPD-100K explicitly addresses these data gaps by covering the full spectrum of privacy risks in unconstrained, complex streaming environments.

🚀 Massive Scale & Quality

100,000 images with over 190,000 object instances. Over half of the dataset exceeds 1080p resolution to support tiny text recognition.

🔍 Fine-Grained Taxonomy

Annotated with 33 fine-grained classes across 4 primary domains: Human Presence (faces categorized by age and environment), On-Screen PII (passwords, chat logs, accounts), Physical Identifiers (passports, bank cards, tickets), and Location Indicators (street, store, and community signs).

🛡️ Ethical Scenario Reconstruction

Digital privacy risks (e.g., banking interfaces) are generated via simulated high-fidelity environments without compromising real user data.

Class Frequency Distribution of VPD-100K

Figure 3: Class frequency distribution sorted by frequency. A square root scale is applied to ensure visual readability, accounting for the inherent long-tail characteristic of such datasets.

⚙️ Frequency-Enhanced Mechanism

Traditional Spatial-based detectors often struggle with "camouflaged" or tiny sensitive content (e.g., verification codes that occupy less than 10% of the image). To solve this, we extend the YOLO architecture to a spatial-frequency dual-stream model. Our module breaks the limitations of spatial pixel intensity by operating in the frequency domain, remapping features to better capture subtle high-frequency details.

Architecture of the Frequency-Enhanced Mechanism

Figure 4: The YOLOv10 framework incorporating the frequency domain module within the Neck architecture. It leverages Frequency-Domain Attention Fusion (FDAF) and an Adaptive Spectral Gating Mechanism.

  • Frequency-Domain Attention Fusion (FDAF): Applies Discrete Fourier Transform (DFT) to map features, explicitly amplifying high-frequency boundary signals critical for text and structural shapes.
  • Adaptive Spectral Gating Mechanism: Learns dynamic band-pass filters to suppress background noise while maintaining the activation of privacy-sensitive textures.
  • Frequency-Consistency Loss: A weighted Euclidean distance constraint that enforces feature alignment between predicted and ground-truth regions directly in the spectral domain.

📊 Benchmark Experiments

We comprehensively evaluated our framework across both diverse image and live streaming video benchmarks. Standard baseline models are fine-tuned on our dataset to ensure generalized visual privacy protection capabilities. Our proposed Frequency-Enhanced Mechanism (FEM) consistently enhances boundaries and detects small objects across modalities.

Table 1: Quantitative results of baseline approaches on VPD-100K image test dataset.

Baselines APV AP50V AP75V APSV APMV APLV GFLOPs Latency (ms) F1-Score
Grounding-DINO 48.1 65.8 62.6 30.4 51.3 62.3 464.0 119.5 0.68
FBRT-YOLO 20.2 45.8 42.2 28.1 46.3 56.2 22.9 3.72 0.43
DEIM-D-FINE-S 49.0 65.9 53.1 30.4 52.6 65.7 26.0 3.46 0.69
Gold-YOLO-S 46.4 63.4 52.2 25.3 51.3 63.6 46.0 3.82 0.66
Gold-YOLO-L 52.7 70.1 58.0 32.1 57.0 70.1 153.8 10.91 0.74
YOLOv7-tiny 38.3 46.6 45.3 19.1 44.2 54.1 12.6 5.43 0.55
YOLOv7 51.1 50.1 59.1 32.6 58.1 68.0 99.7 7.12 0.71
YOLOv8s 44.3 60.5 51.3 24.1 50.8 59.8 26.0 7.13 0.64
YOLOv8L 52.6 68.3 59.1 32.6 58.5 67.3 152.0 14.76 0.72
YOLOv9s 46.0 62.3 50.8 25.6 53.0 62.5 24.0 2.73 0.67
YOLOv9L 53.4 68.6 57.9 33.9 59.1 70.3 124.0 7.73 0.73
YOLOv10s 46.3 62.7 51.3 26.1 53.2 62.7 23.0 2.53 0.65
YOLOv10L 53.8 69.6 58.4 33.6 59.8 70.8 121.0 7.42 0.73
YOLOv10s+our FEM 52.1 67.1 54.6 30.1 55.6 64.3 26.0 2.71 0.71
YOLOv10L+our FEM 58.6 73.4 61.3 36.5 62.3 70.6 132.0 7.51 0.81

Table 1: Quantitative results of different baseline approaches on the VPD-100K image test dataset.

Table 2: Quantitative results of different real-time baseline approaches on VPD-100K live streaming video dataset.

Baselines APV AP50V AP75V APSV APMV APLV
FBRT-YOLO 20.2 45.8 42.2 28.1 46.3 56.2
DEIM-D-FINE-S 49.1 66.1 53.5 30.5 52.8 65.9
Gold-YOLO-S 46.4 63.4 52.2 25.3 51.3 63.6
Gold-YOLO-L 52.5 69.7 57.7 31.8 56.9 69.8
Yolov7-tiny 38.0 45.7 45.3 18.7 43.8 53.9
Yolov7 50.9 49.7 58.8 31.4 57.6 67.8
Yolov8s 44.3 60.5 51.3 24.1 50.8 59.8
Yolov8L 52.5 67.5 58.7 32.0 58.5 67.3
Yolov9s 46.0 62.3 50.8 25.6 53.0 62.5
Yolov9L 53.4 68.6 57.9 32.8 59.1 70.3
Yolov10s 45.6 61.9 49.9 25.4 52.4 61.9
Yolov10L 52.9 68.2 57.6 32.8 59.1 69.7
YOLO10s+our FEM 51.8 66.4 53.9 28.9 55.2 63.8
YOLO10L+our FEM 57.7 72.8 60.4 36.1 61.9 70.0

Table 2: Quantitative results of baseline approaches evaluated on the VPD-100K live streaming video dataset.

BibTeX

@inproceedings{vpd100k_2026,
  title     = {VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection},
  author    = {[Author 1] and [Author 2] and [Author 3] and [Author 4] and others},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}