Multimodal machine learning for risk-stratified bundled payments in spinal surgery

Spine surgery is a major contributor to healthcare costs in the United States¹, with its high volume and complexity contributing to an annual economic burden exceeding $24.3 billion^2,3. Despite the introduction of VBC frameworks such as BPCI and BPCI-A, BPMs have seen limited success in spine surgery^9,10. This is primarily due to the inherent heterogeneity of spinal procedures, as current BPMs rely heavily on MS-DRGs that exhibit significant intra-coding variability^7,11. Additionally, existing models offer minimal risk adjustment for diverse patient presentations, demographics, and comorbidities^9,13. Overcoming these barriers will require more nuanced and universally applicable predictive models that account for procedural complexity and patient-specific factors, enabling cost-efficient care across varied healthcare settings.

Our study addresses this gap by presenting a multimodal ML framework to risk-adjust spine surgery episodes, providing excellent predictive capability for outlier total costs. The resulting four-tiered PSPM captures the influence of patient comorbidities, demographics, and surgical factors on costs, with payments ranging from $17,682.36 in the low-risk group to $123,909.27 in the severe-risk group. Of the 1898 patients in our cohort, 209 (11.0%) exceeded the outlier cost threshold (≥$107,190.10), contributing to total losses of $12.7 million, while the remaining 1689 cases (89.0%) generated total profits of $1.8 million. Outlier patients experienced significantly worse outcomes, including longer median lengths of stay (14–16 days vs. 3 days), higher ICU admission rates (30–34.9% vs. 4.4–5.3%), and increased 90-day reoperation rates (47.8–51.7% vs. 5.4–6.3%), compared to non-outliers. These findings underscore the limitations of flat-rate, two-sided risk BPMs, which penalize providers treating high-risk populations and may disincentivize the acceptance of complex cases.

A stratified BPM that incorporates patient-level risk could address the limitations of flat-rate bundles in spine surgery¹⁹. Hines et al. reported that 42% of spine surgeons altered their care decisions under bundled payment plans compared to fee-for-service models⁷, highlighting the unintended consequences of uniform payment structures, which may discourage providers from taking on complex cases due to financial disincentives. By adjusting payments to reflect patient comorbidities and predicted risks, a stratified BPM could better encourage providers to perform necessary surgeries without fear of financial penalties¹⁹. Predictive frameworks that incorporate patient characteristics—such as demographics, preoperative comorbidities, and surgical approach—effectively stratify patients and align payments with their risk levels¹⁹. This approach, consistent with our work, ensures equitable reimbursement, upholds high standards of care, and incentivizes providers to treat higher-risk populations, advancing VBC. Furthermore, the limited impact of provider- and hospital-level factors on variations in spine surgery payments highlights the need to prioritize patient-level data as the foundation for risk-based BPMs. Kahn et al. argue that BPMs will fall short if they do not account for the complexity and diversity of patient populations, challenging the notion that cost variation can be meaningfully reduced through uniform provider behavior modifications alone²⁰. By incorporating detailed patient-level data, our PSPM aligns with this perspective, offering a direct step toward more equitable and sustainable BPMs that support care for complex patients.

BPMs have shown mixed results in spine surgery, with inconsistent effects on cost savings and patient outcomes. Previous studies have largely focused on Medicare-based models like BPCI and BPCI-A, which apply to a limited subset of patients^10,21. In contrast, Issa et al. demonstrated the potential of private payer BPMs, reporting net losses initially but achieving profitability by year three, alongside reductions in readmissions, revisions, and postoperative LOS²². These improvements were attributed to better risk stratification and increased flexibility in negotiating with private insurers. Our study builds on this approach by leveraging a predictive framework that incorporates patient-specific factors, adaptable across both private and public payer systems. By tailoring payments to individual risk profiles, our PSPM offers a scalable solution to address the variability in spine surgery outcomes, providing sustainable reimbursement strategies across diverse payer systems.

Our work builds upon the foundational contributions of Karnuta et al., who developed a PSPM for patients undergoing dorsal and lumbar fusion using traditional ML methods, by incorporating several key advances¹⁸. First, we employ a multimodal ML approach that jointly reasons over structured data and free-text NLP inputs, improving granularity and personalization over traditional severity indexes such as all-patients-refined (APR) risk scores. Instead of relying on APR classifications, our patient-level modeling leverages intrinsic patient data to reduce susceptibility to administrative gaming and to produce more personalized predictions. Second, we address heteroscedasticity in predicted probabilities through a power transformation, thereby stabilizing the variance across the distribution of predictions to maintain equitable payment distributions across diverse risk strata. Third, we emphasize a lightweight, adaptable model that remains sustainable by leveraging institution-specific data, which can be retrained annually using cumulative patient data to refine cost adjustments and preserve model performance over time. To evaluate longitudinal compatibility, we conducted an additional analysis in which the model was trained on patient data from 2018 to 2022 and tested independently on 2023 data. As shown in Supplementary Table 9, performance remained robust, supporting the feasibility of yearly iterative updates while adjusting for future inflation. This strategy ensures that after each episode of care, the stakeholders involved in the care of the patient may compare the costs accumulated to the projected cost or contract of the episode and resource utilization, thereby enhancing the clinical and economic relevance of the PSPM to individual health systems.

In this work, there were several notable findings in costs and profitability across spinal procedures. For example, ACDF demonstrated a favorable cost-to-profit ratio, as one of the most economically viable spinal procedures. Additionally, lumbar fusion procedures such as TLIF and PLIF show higher mean direct and variable costs compared to ACDF, which aligns with prior literature indicating that lumbar procedures are typically more resource-intensive due to longer operating times, increased complexity, and greater variability in surgical approach^20,21,22. Despite these higher costs, lumbar fusion procedures demonstrate relatively lower profitability, which reflects the growing concern about the financial strain these complex surgeries place on institutions, especially when performed on high-risk patients under BPMs. This discrepancy in profitability between cervical and lumbar procedures underscores the need for refined cost management strategies and ongoing development of PSPMs, particularly for high-complexity, high-cost surgeries like lumbar fusions, where profitability margins are tighter.

Notably, all models in this study were trained on a mixed-income and racially underserved patient population, addressing the underrepresentation of such groups in prior orthopedic ML models. Our cohort predominantly included publicly insured (45.6% Medicare, 33.9% Medicaid) and racially heterogeneous (39.8% Other/Hispanic, 34.8% Black, 12.4% White) patients. A systematic review by Lans et al. revealed that only approximately 25% of studies considered race or ethnicity in their ML models. Furthermore, most existing orthopedic ML models rarely account for broader social determinants of health (SDOH), such as socioeconomic status and insurance type, beyond basic factors like age and gender. However, SDOH—including race/ethnicity, socioeconomic status, and insurance—can significantly affect surgical outcomes, such as length of stay and reoperations. Our model incorporated SDOH by including age, gender, race/ethnicity, and insurance status as key input variables.

Another key strength of our models is their explainability. A major barrier to the clinical adoption of AI and ML is their “black box” nature, where the decision-making processes are often opaque, even to developers, let alone healthcare practitioners. This lack of transparency can undermine trust and raise accountability, regulatory, and ethical concerns^23,24,25,26. For example, the XAI analysis revealed that factors such as specific MS-DRG codes, preoperative hemoglobin and platelet counts, BMI, documentation of scoliosis, and sepsis were among the most influential variables in predicting both total and variable costs (Fig. 3). For instance, a patient undergoing a complex spinal fusion (MS-DRG 453) with an elevated BMI and low preoperative hemoglobin would be flagged by the model as high-risk for outlier costs due to the combination of obesity-related surgical complexity and anticipated increased perioperative resource utilization from anemia. Importantly, if this same patient were evaluated by a clinician, they would likely recognize these same factors as red flags for adverse outcomes and elevated costs. By explicitly showing that the model prioritizes clinically intuitive variables—such as procedure complexity, obesity, and anemia—the model fosters clinician and administrator trust by aligning with real-world decision-making. This transparency helps users not only understand why a patient is flagged as high-risk but also provides actionable information for perioperative optimization and financial planning.

Additionally, the incorporation of NLP captures information embedded within free-text clinical notes, such as procedural details and comorbidities, mirroring the natural process by which clinicians integrate information from multiple sources. In future work, we aim to further enhance model interpretability by incorporating imaging and physical examination data to better approximate the holistic approach clinicians use in patient evaluation.

This study is not without limitations. A minority of the features in our models, such as “mg,” are frequently repeated free-text terms and lack clinical relevance to the primary study outcomes. These features are most likely artifacts of the training data and could contribute to model overfitting.

Additionally, we have not yet performed external validation of this model with patient data from outside institutions. NLP excels at identifying specific patterns in clinical note documentation, but this strength poses challenges for generalizability across institutions due to variations in documentation practices influenced by provider experience, patient demographics, and geographical location. However, we believe the model framework is well-suited for external adaptation, as it primarily relies on EHR variables (e.g., demographics, comorbidities, laboratory values) and cost components that are widely available and standardized across healthcare systems. Moreover, the NLP component employs generalizable preprocessing techniques, which can be retrained on external free-text corpora to account for local documentation patterns, maximizing the model’s text-processing flexibility. Importantly, the model’s modular design allows for straightforward recalibration of outlier cost thresholds and adaptation to institution-specific cost-accounting practices without requiring changes to the model’s core structure. To further address external generalizability, we plan to initiate a prospective study within our institution with full EHR integration and are actively pursuing external validation through collaborations with outside institutions. Future work might include predicting financial parameters as a regression problem with direct cost outputs, rather than a binary classification problem.

Lastly, a limitation of the proposed Patient-Specific Payment Model (PSPM) is the exclusion of certain surgical factors known to significantly influence costs, such as the use of interbody cages, specialized bone grafts, and intraoperative neuromonitoring, due to the unavailability of this data in the current study^9,13. Additionally, because this study was conducted at a single institution, regional variations in cost could not be assessed.

This study is the first to develop a multimodal ML algorithm for predicting patients likely to generate outlier total and variable costs in spine surgery. Our preoperative model maximizes clinical applicability by enabling targeted use of high-risk protocols. By integrating NLP, we increase the model’s adaptability, allowing it to incorporate larger datasets and capture uncoded diagnoses and free-text information. We also leveraged the model’s predictive capabilities to create a risk-stratified PSPM that aligns with VBC and improves BPMs through effective risk stratification and reduced reliance on DRGs. Future work will focus on prospective validation of our model within our institution and external validation to ensure generalizability.

link

Care Harbor

Multimodal machine learning for risk-stratified bundled payments in spinal surgery

Leave a Reply Cancel reply

BackFit Health + Spine Recognizes Long-Term Service of Dr. Justin Gomez in Chandler, AZ

AAO adopts Kaho village to ensure universal eye care coverage

St. Bernards advances regional stroke care with neurointerventional suite

Qatar’s first navigated spine surgery in private healthcare: Precision meets innovation