Insights

Surgical AI Has a Data Problem

Date

May 30, 2026

Author

Dr. Hema Anukula

Four papers published in the first half of 2026 tell a story the surgical AI field has been avoiding for years. Read together, they clarify both how far the technology has come and exactly what is keeping it from entering the operating room.

The Architecture Is Converging

Three of the four papers demonstrate genuine technical maturity.

Liu et al. (PLoS ONE, 2026) introduced a retrieval-augmented multimodal model for surgical image captioning generating rich, accurate descriptions of operative scenes. The advance is specific: it directly addresses the hallucination problem that makes generic foundation models unreliable in clinical settings. Models that can't reliably identify what they're seeing can't support clinical decisions. This work takes a meaningful step toward accurate surgical scene comprehension.

Tang et al. (IEEE TMI, 2026) adapted the Segment Anything Model for surgical video without requiring manual prompts. Their distillation approach identifies vessels, instruments, and tissues across multiple surgical datasets automatically. The significance here is generalizability achieved through a new approach to model training rather than more labeled data.

Liang et al. (MedComm, 2026) combined segmentation and phase recognition in a single integrated system for thoracoscopic lobectomy. The dataset, 222 videos across eight centers, 32,000 annotations, over one million frames is among the most substantial published in the field. Critically, the clinical test was direct: residents trained with LungSurg significantly outperformed those trained by conventional methods. This is one of the clearest demonstrations yet that surgical AI can change outcomes in training, not just benchmarks.

These three papers represent a meaningful maturation of the core capabilities: perception, segmentation, phase recognition. The stack is being assembled from the bottom up.

The Reality Check

Then there is the fourth paper.

Carstens et al. (npj Digital Medicine, 2026) reviewed 188 published studies on surgical scene understanding. The findings should be required reading for anyone building in this space.

  • 70.7% of studies relied on small, single-center datasets

  • Only 10.1% used external validation

  • Just 5.9% addressed clinical translation

  • 59% focused on a single procedure

That last point deserves a moment. More than half of all surgical AI research has been conducted on laparoscopic cholecystectomy, one procedure, repeated across labs globally, producing results that may not generalize to other operations, other institutions, or the broader diversity of surgical practice.

The models are sophisticated. The data they were trained on is not.

The Actual Bottleneck

The shortfall Carstens et al. describe is not primarily a volume problem though volume matters. It is a depth problem.

Most surgical video annotation has focused on instrument-action-target triplets: what tool is being used, what action is being performed, what anatomical structure it is acting on. This is a useful starting point. But it captures only the surface of what surgical expertise actually consists of.

What it misses is the reasoning layer: why a surgeon chose this approach at this moment, for this patient, given this anatomy and this point in the procedure. The contextual and decision-level information that makes the difference between a model that can label a surgical video and a model that can actually support surgical judgment.

This is not a criticism of the researchers. That reasoning layer is genuinely difficult to capture. It lives in the implicit knowledge of experienced surgeons, pattern recognition and decision-making that has never been systematically articulated, let alone annotated.

The field has built powerful perception tools. The missing layer is context.

Why This Is Nuevata's Problem to Solve

At Nuevata, we designed our platform around a specific thesis: that surgical intelligence has to begin with clinical understanding, not algorithmic capability.

Iris, our head-mounted 4K capture system, was built to capture open surgery, procedures that generate almost no structured video data today, because standard laparoscopic and robotic systems create their own footage while open cases do not. We are building toward the datasets that don't yet exist.

Mira, our AI-powered analytics and training platform, was designed from the ground up to support annotation that goes beyond instrument and phase tagging toward the contextual and decision-level layers the field is only now identifying as the core constraint.

This is a long-term project. The papers above confirm both the ambition and the gap. We're not the only ones who see it.


Papers referenced:

Liu et al., "Surgical Image Captioning," PLoS ONE (2026)
Tang et al., "Auto-Prompt Segmentation for Surgical Video," IEEE TMI (2026)
Liang et al., "LungSurg: Integrated Segmentation and Phase Recognition," MedComm (2026)
Carstens et al., "Surgical Scene Understanding: A Systematic Review," npj Digital Medicine (2026)