← All engineering posts

AI/ML SaMD: FDA Artificial Intelligence Guidance in Practice

By Tammo Mueller

AI/ML SaMD: FDA Artificial Intelligence Guidance in Practice

The FDA has authorized over 1,300 AI-enabled medical devices. Radiology accounts for 76% of them. Cardiology is second at 10%, followed by neurology, hematology, and gastroenterology. The pace is accelerating: 295 AI/ML devices cleared in 2025 alone, a record year. The regulatory framework has grown to match. Six years ago, the FDA had a discussion paper and a five-pillar action plan. Now there are six overlapping guidance documents, a finalized PCCP rule, and an IMDRF consensus standard. Teams building AI SaMD need to know which documents matter and which they can skim.

For ContourCompanion, an IMDRF Category IV AI SaMD that generates radiation therapy contours from medical imaging data, the AI/ML-specific requirements added a distinct layer on top of the standard design controls, risk management, and software lifecycle work. The algorithm originated from decades of research at Penn Medicine's Medical Image Processing Group. Productizing it meant building the validation evidence package alongside the software. The engineering was one workstream. Convincing FDA reviewers that the algorithm performed consistently across the intended patient population was another, and it consumed more calendar time than the code.

The FDA's AI/ML Regulatory Stack

There is no single "FDA artificial intelligence guidance." The regulatory framework is a stack of documents published over six years, each addressing a different dimension of AI/ML device development:

DocumentYearStatusFocus
AI/ML SaMD Action Plan2021FinalFive-pillar strategic roadmap
GMLP Guiding Principles2021Final (IMDRF 2025)10 development principles
Clinical Performance Assessment2022FinalPerformance metrics for radiology AI
Transparency Guiding Principles2024FinalUser-facing transparency for ML devices
PCCP for AI-DSFs2024FinalPredetermined change control for AI updates
AI-DSF Lifecycle Management2025DraftComprehensive TPLC submission recommendations

The January 2025 draft guidance on AI-Enabled Device Software Functions is the most comprehensive document to date: 13 chapters and 5 appendices covering the full lifecycle from data management through post-market monitoring. It consolidates and extends recommendations that were previously scattered across multiple documents. Teams building AI SaMD should treat it as the primary reference, even in draft form.

One terminology trap: the AI community uses "validation" to mean the tuning dataset (the data held out during training to tune hyperparameters). The FDA uses "validation" per 21 CFR 820.3(z): confirmation that requirements for a specific intended use are fulfilled. The January 2025 draft guidance calls this out explicitly. When FDA reviewers ask about your "validation approach," they mean clinical performance on independent test data, not your train/val/test split. Confusing these terms in a premarket submission is a deficiency waiting to happen.

Good Machine Learning Practice: The 10 GMLP Principles

Good Machine Learning Practice (GMLP) is a set of 10 guiding principles published jointly by the FDA, Health Canada, and the UK MHRA in October 2021. The IMDRF elevated these to a final consensus document (N88) in January 2025, giving them international regulatory weight.

The 10 principles:

  1. Multi-disciplinary expertise across the total product lifecycle
  2. Good software engineering and security practices (IEC 62304, security by design)
  3. Representative clinical study participants and datasets
  4. Training datasets independent of test datasets
  5. Reference datasets based on best available methods
  6. Model design tailored to available data
  7. Human factors focus (automation bias, user interaction)
  8. Testing under clinically relevant conditions
  9. Clear, relevant information to users (transparency)
  10. Monitor deployed models (post-market surveillance, performance degradation)

These principles are not checklist items. They are the conceptual framework FDA reviewers use to evaluate whether your development process was sound. Four of them generated the most scrutiny during ContourCompanion's review:

Principle 3 (representative data) drew the most questions. The training data came from a limited number of clinical sites, and the FDA wanted evidence that performance generalized to the broader intended use population. The multi-center clinical evaluation was designed specifically to address this: test data from sites not represented in the training set, with demographic diversity across patients. Where a particular subgroup was underrepresented, we documented the limitation rather than overstating coverage. That documentation turned out to be more important than the subgroup data itself: it showed the FDA we knew exactly what we had proven and what we hadn't.

Principle 7 (human factors) was the second area of focus. The clinical workflow required radiation oncologists to review and approve every AI-generated contour before export to the treatment planning system. The FDA wanted to understand how automation bias was mitigated. A clinician who trusts the AI output stops independently evaluating it, and a subtle contouring error propagates to the treatment plan. The clinical workflow design, the human factors evaluation, and the IFU all had to address this systematically.

Principle 2 (software engineering) mapped directly to the IEC 62304 lifecycle we were already running. Principle 10 (monitoring deployed models) mapped to the post-market surveillance plan and performance monitoring infrastructure.

For teams early in development: map your development practices to all 10 principles before writing the first line of your premarket submission. Gaps in your GMLP coverage become additional information requests during review.

Locked vs adaptive algorithm regulatory decision tree showing PCCP path for adaptive algorithms and standard submission for locked algorithms

Locked Algorithms vs. Adaptive: The Regulatory Fork

Every AI/ML SaMD faces a binary regulatory decision: locked or adaptive.

A locked algorithm produces the same output for the same input, always. It does not learn from real-world data after deployment. The vast majority of FDA-cleared AI/ML devices use locked algorithms. Traditional verification and validation applies: test the algorithm, demonstrate performance, submit the evidence, and the cleared version is the cleared version.

An adaptive (continuously learning) algorithm changes its behavior over time based on new data. The FDA has not yet authorized a truly continuously learning device. The December 2024 PCCP final guidance allows manufacturers to pre-specify how their AI models will change post-clearance, but as legal analysis has noted, the PCCP framework does not inherently support real-time continuous learning. Modifications under a PCCP must be specific, verifiable, and bounded.

ContourCompanion used a locked algorithm for the initial clearance. The Penn Medicine research behind it was mature, and the clinical evidence supported the locked model's performance. In practice, "locked" meant a validated container image with a pinned version tag, deployed through the same IaC pipeline as the rest of the infrastructure. Same image, same weights, same output for every patient. The container tag was the version control artifact that FDA could point to and say: this is the cleared algorithm.

An Algorithm Change Protocol was established for future model updates, allowing retraining on expanded datasets without filing a new 510(k) for each update, provided the changes stayed within pre-specified boundaries. The FDA's December 2024 PCCP guidance formalizes this pattern. The next post in this series covers PCCP design in detail.

The practical implication: if you are building an AI SaMD, plan for a locked algorithm at initial submission. Even if your long-term vision involves adaptive learning, the regulatory path is clearer and faster with a locked model first. Layer in a PCCP to define how the model evolves post-clearance.

Algorithm Validation: What FDA Reviewers Expect

AI/ML SaMD validation evidence flow from data management through ground truth, model development, performance assessment, and FDA submission

Algorithm validation for AI/ML SaMD follows a structure the FDA has refined across hundreds of clearances. The January 2025 draft guidance recommends sponsors include:

Data management documentation:

  • Data collection sources, protocols, and inclusion/exclusion criteria
  • Data cleaning and processing pipeline documentation
  • Reference standard (ground truth) establishment methodology
  • Annotator qualifications, protocols, and inter-annotator agreement metrics
  • Evidence of training/test dataset independence (segregation proof)
  • Representativeness across demographic groups and clinical subgroups

Performance assessment:

  • Metrics appropriate to the intended use (sensitivity, specificity, AUC, Dice coefficients, Hausdorff distances, depending on the device type)
  • Performance broken down by clinically important subgroups (patient demographics, geographic sites, imaging equipment)
  • All known limitations documented

For radiology AI specifically, the FDA's Clinical Performance Assessment guidance (September 2022) is the key reference. It defines the study designs FDA accepts:

Standalone performance testing is the baseline. Run the algorithm on an independent test dataset and report accuracy metrics against expert-established ground truth. The test data must be multi-site, demographically varied, and completely sequestered from training data.

Multi-Reader Multi-Case (MRMC) studies are expected when the AI assists clinical decision-making. Multiple clinicians read the same cases with and without AI assistance. MRMC accounts for reader variability and produces performance estimates that generalize beyond the specific readers in the study. For an autocontouring SaMD, this translated to radiation oncologists evaluating AI-generated contours against manually drawn contours across a representative case set.

Comparative reader studies use a crossover design: clinicians perform the task unaided, then with AI assistance (or vice versa), with a washout period. The primary endpoint is the difference in performance metrics between the aided and unaided conditions.

The study design choice depends on the intended use. A device that replaces a clinical function needs standalone evidence that it performs at or above the standard of care. A device that assists a clinician needs MRMC or comparative evidence that the human-AI team outperforms the human alone.

For ContourCompanion, the validation combined standalone performance testing (algorithm accuracy against expert-drawn ground truth contours) with clinical evaluation at multiple treatment sites. The standalone testing established that the algorithm met accuracy thresholds across the target anatomical structures. The multi-site evaluation established that performance held across different clinical environments, imaging equipment, and patient populations. Both were necessary. Standalone testing gave us the per-structure accuracy numbers FDA needs to evaluate an autocontouring device. The multi-site evaluation answered the generalizability question that standalone testing on data from one institution cannot.

The evaluation also measured time savings, which mattered for the clinical adoption argument but was secondary to accuracy for the FDA review. The published literature reports time savings of roughly 69% for AI autocontouring. Documenting this in the submission supported the substantial equivalence argument but did not substitute for the geometric accuracy evidence.

Ground Truth and Inter-Annotator Agreement

The reference standard, what FDA calls the "best-available representative or ground truth," is the foundation of every performance claim. For ContourCompanion, ground truth was expert-drawn contours created by experienced radiation oncologists. The quality of these annotations directly bounded the credibility of every performance metric derived from them.

FDA CDRH scientists have published guidance on reference standard establishment, including expert panel consensus, biopsy-confirmed ground truth, and follow-up confirmed ground truth. For imaging AI, expert panel consensus is the most common approach: multiple qualified annotators independently label the same data, and agreement metrics quantify consistency.

Inter-annotator agreement matters because it sets a ceiling on measurable algorithm performance. If two expert radiation oncologists disagree on the boundary of a structure by 2mm on average, an algorithm that matches expert performance within that 2mm range is performing at expert level, even if its Dice coefficient against any single expert is not perfect. FDA reviewers understand this. They look for the inter-annotator agreement data alongside the algorithm performance data. If you report algorithm accuracy but not annotator agreement, reviewers will ask for it.

For autocontouring specifically, published benchmarks provide context: mean Dice Similarity Coefficients of 0.81-0.86 across structures, with clinically acceptable Hausdorff distances that vary by anatomical region. These benchmarks informed the clinical evaluation design but did not replace the need for site-specific performance evidence against the intended patient population.

Bias Assessment and Dataset Representativeness

GMLP Principle 3 requires datasets representative of the intended patient population. The January 2025 draft guidance goes further, recommending performance reporting across demographic subgroups including race, ethnicity, sex, and age. This is not aspirational. FDA reviewers check.

Bias in AI SaMD enters through multiple pathways:

SourceDescriptionFDA Concern
Training data compositionUnderrepresentation of demographicsReduced performance for underrepresented groups
Label qualityInconsistent ground truth labelingSystematic errors aligned with labeler biases
Selection biasNon-random data selectionPerformance that doesn't generalize
Measurement biasSystematic differences across sitesConfounded performance metrics
Automation biasOver-reliance on AI output by usersDifferential patient impact where AI performs poorly

Published research demonstrates the risk: one study found an ECG deep learning model had an AUC of 0.81 in patients aged 18-60 versus 0.73 in patients 81+, with numerically worse performance in Black patients compared to White patients. Performance disparities like these are exactly what FDA subgroup analysis is designed to detect.

For ContourCompanion, the multi-center clinical evaluation included data from multiple treatment sites with demographic diversity across the patient datasets. Performance was reported by anatomical region and, where the dataset supported it, by patient subgroup. Where the dataset was insufficient for a particular subgroup analysis, that limitation was documented explicitly. FDA reviewers respond better to honestly documented limitations than to claims of universal performance without supporting subgroup data.

Build your test dataset with subgroup analysis in mind from day one. Retrofitting demographic diversity after the algorithm is trained costs months you don't have.

The FDA AI/ML Premarket Submission Package

Pulling the above together, here is what an AI/ML SaMD premarket submission typically includes beyond the standard design controls and software documentation:

  1. Model description: Algorithm type, architecture summary, input/output specification, intended use statement specific to the AI function
  2. Data management plan: Training data sources, collection protocols, cleaning pipeline, annotation methodology, train/test segregation evidence
  3. Reference standard documentation: Ground truth establishment, annotator qualifications, inter-annotator agreement metrics
  4. Performance results: Primary and secondary endpoints on independent test data, subgroup analysis, all known limitations
  5. Bias assessment: Analysis of performance across demographic and clinical subgroups, documented limitations where data was insufficient
  6. Human-AI interaction: Clinical workflow integration, human factors evaluation, automation bias mitigation
  7. Monitoring plan: Post-market performance monitoring methodology, degradation detection approach, trigger criteria for reassessment
  8. PCCP (if applicable): Pre-specified modification boundaries, validation protocol for updates, impact assessment methodology

Items 1-6 are expected in the initial submission. Items 7-8 demonstrate lifecycle thinking that FDA increasingly expects. The risk management file ties everything together: AI-specific hazards (algorithm inaccuracy, distribution shift, automation bias) map to risk controls, which map to validation activities that produce the evidence above.

What FDA Reviewers Actually Ask About

FDA reviews of AI/ML SaMD submissions produce Additional Information (AI) requests when the submission has gaps. These are the questions that come back most often:

Data representativeness. "Provide demographic breakdown of the test dataset and performance metrics by subgroup." This is the single most frequent AI-specific question. If your test dataset comes from two academic medical centers in the same region, FDA will ask how performance generalizes to the broader intended use population.

Ground truth methodology. "Describe the reference standard establishment process, including annotator qualifications and agreement metrics." If the ground truth is a single expert's annotation without agreement data, FDA flags it. A reference standard derived from a panel consensus with documented inter-annotator agreement is far stronger.

Training/test independence. "Provide evidence that test data was fully independent from training data." Any data leakage, even indirect (same patient appearing in both sets, or temporal overlap without documentation), triggers scrutiny. Temporal splits (training on older data, testing on newer data) are stronger than random splits, because they simulate real-world deployment where the model encounters data collected after training.

Clinical workflow integration. "Describe how the device output is presented to the user, and how the labeling mitigates automation bias." For a device that assists rather than replaces, the FDA cares about whether the user can override the AI, whether the interface communicates uncertainty, and whether the clinical workflow preserves independent clinical judgment.

Known limitations. "Identify patient populations, imaging conditions, or clinical scenarios where device performance has not been evaluated or is expected to degrade." Honesty about limitations is a strength in FDA submissions, not a weakness. Undocumented limitations that surface post-market become the basis for recalls.

Frequently Asked Questions

What is GMLP and why does it matter for AI SaMD?

Good Machine Learning Practice (GMLP) is a set of 10 guiding principles published jointly by the FDA, Health Canada, and MHRA. The principles cover development team composition, data management, model design, testing, transparency, and post-market monitoring. GMLP is not a binding regulation, but it is the conceptual framework FDA reviewers use to evaluate AI/ML device submissions. Teams that cannot demonstrate alignment with GMLP principles should expect additional information requests during review.

What is the difference between a locked and adaptive AI algorithm for FDA purposes?

A locked algorithm produces identical output for identical input and does not change after deployment. An adaptive algorithm can change based on new data. The vast majority of FDA-cleared AI devices use locked algorithms. Adaptive algorithms require a Predetermined Change Control Plan (PCCP) that pre-specifies what changes are allowed, how they will be validated, and how their impact will be assessed. The FDA has not yet authorized a truly continuously learning device that updates in real time.

What performance metrics does FDA expect for AI/ML SaMD?

The metrics depend on the intended use. For diagnostic AI, FDA expects sensitivity, specificity, ROC/AUC, and false-positive/negative rates. For segmentation and contouring AI, Dice Similarity Coefficients and Hausdorff distances are standard. All metrics must be reported on independent test data with subgroup analysis across demographics and clinical variables. The FDA's Clinical Performance Assessment guidance provides specific recommendations for radiology AI performance evaluation.

References

FDA AIGMLPMachine LearningSaMDAlgorithm Validation