SurgSync: Time-Synchronized Multi-Modal Data Collection Framework and Dataset for Surgical Robotics

¹Worcester Polytechnic Institute ²Johns Hopkins University ³University of British Columbia ⁴Politecnico di Milano

Paper (Coming Soon) arXiv (Coming Soon) Dataset (Coming Soon)

Code-Toolbox (Coming Soon) Code-Framework (Coming Soon)

SurgSync overview diagram: framework, dataset, and post-processing toolbox

SurgSync is a time-synchronized multi-modal data collection framework for surgical robotics, featuring dual-mode recorders, a modern stereo endoscope, capacitive contact sensing, and a post-collection processing toolbox — validated on 214 user-study instances across four surgical tasks.

Authors

^1,2 WPI / JHU

² JHU

² JHU

² JHU

³ UBC

³ UBC

⁴ Polimi

Septimiu E. Salcudean

³ UBC

Gregory S. Fischer

¹ WPI

Peter Kazanzides

² JHU

¹ Worcester Polytechnic Institute ² Johns Hopkins University ³ University of British Columbia ⁴ Politecnico di Milano

Correspondence: hzhou6@wpi.edu · pkaz@jhu.edu

Abstract

Most existing robotic surgery systems adopt a human-in-the-loop paradigm, often with the surgeon directly teleoperating the robotic system. Adding intelligence to these robots would enable higher-level control, such as supervised autonomy or even full autonomy. However, artificial intelligence (AI) requires large amounts of training data, which is currently lacking.

This work proposes SurgSync, a multi-modal data collection framework with offline and online synchronization to support both training and real-time inference, respectively. The framework is implemented on a da Vinci Research Kit (dVRK) and introduces (1) dual-mode (online/offline-matching) synchronized recorders, (2) a modern stereo endoscope to achieve image quality on par with clinical systems, and (3) additional sensors such as a side-view camera and a novel capacitive contact sensor to provide ground truth contact data.

The framework also incorporates a post-processing toolbox for tasks such as depth estimation, optical flow, and a practical kinematic reprojection method using Gaussian heatmap. User studies with participants of varying skill level are performed with ex-vivo tissue to provide clinically realistic data, and a network for surgical skill assessment is employed to demonstrate utilization of the collected data. Through the user study experiments, we obtained a dataset of 214 validated instances across multiple canonical training tasks. All software and data will be made available to the research community.

Video

Video Coming Soon

A demo and explainer video will be posted here upon paper publication.

Dataset

User Study & Data Collection

We conduct a user study with 13 human subjects (3 female, 10 male) spanning three skill levels — Novice (N), Experienced (E), and Professional (P). Tasks are performed on phantoms and ex-vivo tissues for clinical realism.

The resulting dataset comprises 214 validated instances across four canonical surgical training tasks. Each instance provides time-aligned:

Stereo endoscope images (1080p, 60 Hz)
Side-view camera images (1080p, 30 Hz)
ECM & PSM kinematic data (6D Cartesian + gripper)
Tool-tissue contact ground truth (binary)
Timestamps across all modalities

All modalities are frame-aligned by our synchronized recorders, enabling direct multi-modal model training without additional interpolation. (the recording frequency can be found in the Results section)

Dataset Distribution

Training Task	Novice (N)		Experienced (E)		Professional (P)		Total
Training Task	Online	Offline	Online	Offline	Online	Offline	Total
Suturing & Knot Tying	13	2	36	12	2	39	104
Peg Transfer	7	—	11	—	—	—	18
Tissue Manipulation	9	—	12	—	—	—	21
Dissection	6	—	15	9	1	40	71
Total	35	2	74	21	3	79	214

■ Online ■ Offline — = not collected

Post-Collection Processing Toolbox

Beyond data acquisition, we provide a configurable and extensible post-collection processing toolbox that standardizes calibration and data processing — covering stereo rectification, kinematic reprojection, depth estimation, optical flow, and data annotation.

Kinematic Reprojection 4× speed

Projects 3D PSM tool-yaw link positions into the endoscope image plane using hand-eye calibration and Gaussian heatmap rendering, providing per-frame kinematic ground truth for both PSM1 and PSM2.

Original Endoscope Video

Kinematic Reprojection

PSM1 Kinematic Reprojection

PSM2 Kinematic Reprojection

Depth Estimation 4× speed

Dense stereo depth estimation using FoundationStereo on the rectified stereo endoscope stream, producing per-pixel disparity maps synchronized with all other modalities.

Original Endoscope Video

Disparity Map

Disparity Map Video

Optical Flow 4× speed

Dense optical flow estimation using RAFT on sequential endoscope frames, capturing instrument and tissue motion patterns for training motion-aware models.

Original Endoscope Video

Optical Flow

Optical Flow Video

Annotation GUI 4× speed

A custom PyQt-based annotation tool for labeling tool-tissue contact detection and task descriptions during frame-by-frame playback of synchronized recordings.

Annotation Interface

Data annotation GUI built with PyQt for contact detection and task description labeling — Annotation GUI

Demo

Quick Demo on how to use the Annotation GUI

Key Contributions

Dual-Mode Synchronized Recorder

Two synchronized recorders — online-matching for real-time use (6.36 ± 4.72 ms mean latency) and offline-matching for high-fidelity dataset construction (1.35 ± 0.81 ms mean latency). Time synchronization is treated as a first-class design constraint across all modalities.

Upgraded Imaging Stack

Integration of a modern chip-on-tip stereo endoscope with the dVRK-Si via a custom 3D-printed holder. Achieves over 30× higher Laplacian variance (529.48 ± 23.77 vs 16.93 ± 2.47) compared to the default dVRK-Si scope — delivering significantly sharper frames for downstream perception tasks.

Tool-Tissue Contact Ground-Truth Sensing

A capacitive contact sensor (Arduino UNO Rev3) seamlessly integrated via the dVRK digital input. Supports non-polar, monopolar, and bipolar surgical instruments. Provides reliable binary contact ground truth on human and ex-vivo animal tissues.

Post-Collection Processing Toolbox

A configurable and extensible toolbox covering stereo rectification, kinematic reprojection via Gaussian heatmap (PSM tool-yaw → image), depth estimation (FoundationStereo), optical flow (RAFT), and a custom PyQt data annotation GUI for contact and event/phase labeling.

System Setup

Overall experimental setup: dVRK-Si with modern endoscope, side camera, and contact sensor

Hardware Integration

The primary system is built on the dVRK-Si platform (Patient Side Manipulator, Endoscope Camera Manipulator, Master Tool Manipulators). A secondary cross-platform setup on dVRK Classic validates generalizability.

Chip-on-tip stereo endoscope — mounted on the ECM via a custom 3D-printed compact holder with coaxial alignment and quick attachment
Side-view camera — Intel® RealSense™ RGBD for additional viewpoint
Capacitive contact sensor — Arduino UNO Rev3 connected to the dVRK digital input for tool-tissue contact detection
Master + Client PC architecture — Master PC handles image pipeline; Client PC runs dVRK software

Cameras capture at 1080p. The stereo endoscope runs at 60 Hz; the side camera at 30 Hz. The dVRK kinematic/control loop runs at 1 kHz.

Imaging Quality Comparison

dVRK-Si Endoscope
Laplacian Variance: 16.93 ± 2.47

Custom chip-on-tip endoscope sample frame

Our Modern Chip-on-Tip Endoscope
Laplacian Variance: 529.48 ± 23.77

Results

Synchronization Performance

Attribute	Online-Matching	Offline-Matching
Time latency mean ± std (ms)	6.36 ± 4.72	1.35 ± 0.81
Time latency median (ms)	5.58	1.33
Recording frequency (Hz)	4.04 ± 1.69	10
Ready to use immediately	✅ Yes	⚙️ Post-collection time-matching required

Raw time latency distribution histograms for both recorders

Raw timestamp-delay distribution across all ROS topics for both recorders.

Dataset Validation via Skill Assessment

To demonstrate the utility of the SurgSync dataset, we validate it through a surgical skill assessment task. We train a unified multi-path assessment model that jointly ingests kinematic (14D Cartesian + gripper state), visual (2048D ResNet-101 features), and gesture (14D one-hot encoded) modalities — all modalities are per-frame aligned by our synchronized recorder, eliminating the need for additional temporal interpolation. More detailed results are demonstrated in the manuscript.

BibTeX

@inproceedings{zhou2026surgsync,
  author    = {Zhou, Haoying and Liu, Chang and Wu, Yimeng and Wu, Junlin
               and Wu, Zijian and Lee, Yu Chung and Martuscelli, Sara
               and Salcudean, Septimiu E. and Fischer, Gregory S.
               and Kazanzides, Peter},
  title     = {{SurgSync}: Time-Synchronized Multi-modal Data Collection
               Framework and Dataset for Surgical Robotics},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
}