VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection

Abstract

Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in real-world environments.

To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming video benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the well-curated frequency mechanism.

🎯 The VPD-100K Dataset

Existing datasets overlook critical leakage sources like on-screen PII or suffer from coarse categorization (e.g., just "person" or "text"). VPD-100K explicitly addresses these data gaps by covering the full spectrum of privacy risks in unconstrained, complex streaming environments.

🚀 Massive Scale & Quality

100,000 images with over 190,000 object instances. Over half of the dataset exceeds 1080p resolution to support tiny text recognition.

🔍 Fine-Grained Taxonomy

Annotated with 33 fine-grained classes across 4 primary domains:

Human Presence: Faces categorized by age and environment.
On-Screen PII: Passwords, chat logs, accounts, etc.
Physical Identifiers: Passports, bank cards, tickets.
Location Indicators: Street, store, and community signs.

🛡️ Ethical Scenario Reconstruction

Digital privacy risks (e.g., banking interfaces) are generated via simulated high-fidelity environments without compromising real user data.

Figure 1: The overview of our taxonomy. Our dataset spans 4 broad domains and breaks down into 33 precise, fine-grained categories to cover diverse privacy leakage scenarios.

⚙️ Frequency-Enhanced Mechanism

Traditional Spatial-based detectors often struggle with "camouflaged" or tiny sensitive content (e.g., verification codes that occupy less than 10% of the image). To solve this, we extend the YOLO architecture to a spatial-frequency dual-stream model. Our module breaks the limitations of spatial pixel intensity by operating in the frequency domain, remapping features to better capture subtle high-frequency details.

Figure 2: The YOLOv10 framework incorporating the frequency domain module within the Neck architecture. It leverages Frequency-Domain Attention Fusion (FDAF) and an Adaptive Spectral Gating Mechanism.

Frequency-Domain Attention Fusion (FDAF): Applies Discrete Fourier Transform (DFT) to map features, explicitly amplifying high-frequency boundary signals critical for text and structural shapes.
Adaptive Spectral Gating Mechanism: Learns dynamic band-pass filters to suppress background noise while maintaining the activation of privacy-sensitive textures.
Frequency-Consistency Loss: A weighted Euclidean distance constraint that enforces feature alignment between predicted and ground-truth regions directly in the spectral domain.

📊 Benchmark Experiments

We comprehensively evaluated our framework against 14 competitive baselines on the VPD-100K image test dataset. The models were fine-tuned to enable generalizable privacy protection capabilities. As shown in the table below, our Frequency-Enhanced Mechanism (FEM) integrated with YOLOv10L achieves state-of-the-art performance across the board.

Specifically, YOLOv10L+our FEM achieves the highest Average Precision (58.6 AP), showing massive gains (+8.9% AP) over standard models. Crucially, it demonstrates robust detection capabilities for small objects (36.5 AP_S) and maintains an impressive F1-Score of 0.81, all while operating at a real-time latency of just 7.51 ms.

Baselines	AP^V	AP₅₀^V	AP₇₅^V	AP_S^V	AP_M^V	AP_L^V	GFLOPs	Latency (ms)	F1-Score
Grounding-DINO	48.1	65.8	62.6	30.4	51.3	62.3	464.0	119.5	0.68
FBRT-YOLO	20.2	45.8	42.2	28.1	46.3	56.2	22.9	3.72	0.43
Gold-YOLO-S	46.4	63.4	52.2	25.3	51.3	63.6	46.0	3.82	0.66
Gold-YOLO-L	52.7	70.1	58.0	32.1	57.0	70.1	153.8	10.91	0.74
YOLOv8s	44.3	60.5	51.3	24.1	50.8	59.8	26.0	7.13	0.64
YOLOv8L	52.6	68.3	59.1	32.6	58.5	67.3	152.0	14.76	0.72
YOLOv10s	46.3	62.7	51.3	26.1	53.2	62.7	23.0	2.53	0.65
YOLOv10L	53.8	69.6	58.4	33.6	59.8	70.8	121.0	7.42	0.73
YOLOv10s+our FEM	52.1	67.1	54.6	30.1	55.6	64.3	26.0	2.71	0.71
YOLOv10L+our FEM	58.6	73.4	61.3	36.5	62.3	70.6	132.0	7.51	0.81

Table 1: Quantitative results of different baseline approaches on the VPD-100K image test dataset. The best scores are highlighted in bold. For brevity, some intermediate YOLO variants are omitted; refer to the paper for the full 14-baseline comparison.

Qualitative visual performance comparison

Figure 3: Visual performance of the proposed Frequency-Enhanced Mechanism. Our framework tightly captures tiny PII structures previously smoothed out by standard spatial convolutions.

BibTeX

@inproceedings{vpd100k_2026,
  title     = {VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection},
  author    = {[Author 1] and [Author 2] and [Author 3] and [Author 4] and others},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}