false
Catalog
AUGS FPMRS Webinar: Clinical Prediction Models and ...
April14WebinarRecording
April14WebinarRecording
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Falk moderator for today's webinar. Today's webinar is Health Services Research at FPRMS presented by Director Eric Jelocic. He will present for 45 minutes and the last 15 minutes of this webinar will be dedicated to Q&A. Dr. Jelocic is an Associate Professor of OB-GYN at Duke University of Medicine in Durham, North Carolina, where he currently serves as Vice Education Director of Women's Health Data Science Program. He received an MD degree from East Tennessee State University, completed his residency in OB-GYN at Duke University, and fellowship in FPRMS at Cleveland Clinic. He holds a Master's Degree in Medical Education with Distinction from the University of Dundee and a Master's Degree in Data Science from Northwestern University. Dr. Jelocic's expertise lies in the development and validation of individualized patient-centered prediction tools to improve patient and clinician decision-making around a variety of women's health conditions, including the risk of pelvic floor disorders after child birth, prediction of prolapse recurrence and utility change after undergoing pelvic organ prolapse surgery, complications and urinary incontinence after pelvic organ prolapse surgery, urinary incontinence recurrence and complications after bidurethral sling placement, risk of de novo stress urinary incontinence after surgery for pelvic organ prolapse, and transfusion during gynecologic surgery. He currently leads the clinical deployment of these tools in the EMR at the Department of OB-GYN at Duke. He serves as Co-Principal Investigator of the NIDDK Symptoms of Low Urinary Tract Dysfunction Research LEARN and is involved in the unsupervised learning approaches such as clustering of low urinary tract phenotypes. He also serves as an investigator in the NICHD Pelvic Floor Disorders Network and a mentor in the NIDDK Duke CURE Program and the NICHD AUGS-DUKE Uruguayan CREST Program. Before we begin, I'd like to review some housekeeping items. The webinar is being recorded and live-streamed. Please use the Q&A feature of the Zoom webinar to ask any questions and use the chat feature if you have any technology issues. The AUGS staff will be monitoring the chat and now we'll turn the stage over here to Dr. Jalosa. All right, that's a mouthful. Thanks, Cynthia. Well, good evening everyone and welcome. I see some familiar names on the attendee list and I'm glad you're willing to spend the evening talking about clinical prediction models. This should be fun. What I'm going to do is I'm going to provide an overview of clinical prediction models and decision making and I'll assume that most of you have at least some interest in learning about clinical prediction models and maybe you have an idea about developing such a model or maybe you're eager to hear about some of the controversies in clinical prediction modeling and so hopefully in this talk I'll get to touch upon each one of these. I am going to build upon the previous AUGS state of the science lecture so we've added some new things in here but if you've heard this there'll be a little bit of repetition and I think in this space that's probably a good thing and then we've added some new things here. So why don't we get started. I have no really relevant disclosures for this talk. So the purpose of the lecture is to understand important characteristics of clinical prediction models and where machine learning algorithms might play a role in clinical decision making and I hope that you'll understand how to evaluate and in particularly critique studies that claim to predict outcomes to improve the quality of research and peer review in this space. All right so the first thing we're going to talk about is implicit versus explicit studies of prediction and we know that many medical papers are either implicitly or explicitly about prediction and studies that examine prognostic factors, novel biomarkers, family history, genetic features are often addressing their value with respect to predicting a future outcome of interest and a few examples are shown on this slide and in these types of studies what we're used to seeing in the peer review literature are summary tables of odds ratios, risk ratios, and p-values and such. Now more obvious or explicit examples of prediction are those that present a statistical prediction model or some other algorithm or equation that produces predictions for an individual patient and in these studies what we want to see is performance of the model and that it's accurately predicting in the anticipatedly used population and so my hope for this talk is that I want to emphasize that implicit and explicit studies really should be evaluated at least in part from a prediction science perspective regarding the quality of the study. In fact most clinicians will take implicit studies of prediction and translate them into explicit during their clinical use so it's important that they be evaluated that way. So the first question I would love for you to ask yourself if or when you're thinking about a study about risk factors is really should I be using a predictive science approach here and if so then this talk is certainly for you. So one of the challenges that some of the traditional research studies such as those aimed at simply identifying risk factors is that they're really not considered a clinical prediction study and they really should be and so for example this is a an excellent study looking at risk factors for over 54,000 women who underwent reconstructive surgery from 2010 to 14 in the NISQIP database and the study concluded that transfusion after pelvic organ prolapse surgery is uncommon and the variables associated with transfusion are listed in the table that you can see and the authors concluded that recognition of these factors can help guide the preoperative counseling regarding transfusion risk after pelvic organ prolapse surgery and to individualize preoperative preparation. So again the problem here is that the answer to the question the clinical question is does not help or I'm sorry the answer to the research question here does not really help the individual patient. The risk factors actually need to be combined in some fashion to create a model or tool that can be used to provide individual predictions to patients. So reframing a question like this one into one aimed at building and validating a model a prediction model would really benefit individual patients and so a more proper question might be what is the most accurate model for predicting transfusion after surgery for prolapse or reconstructive surgery and then we can comment on which risk factors are in the the accurate model and the potential influence that they might have. So if we take another question which really again should be refrained is about biomarkers and a common one is which biomarker has the most prognostic value and so fundamentally whether the new marker is the say most important factor is really not the most relevant clinical issue at hand. The real question is does the new marker contribute to our ability to predict the patient's outcome and so what we really want to know is what is the patient's risk with and without a biomarker information and so having a model with and without the biomarker in it is useful for comparison to see how much value added there might be. So shown is another excellent systematic review by the NIDDK's learn network again looking at the biomarkers and they identified multiple challenges in the literature in this area and there are no currently valid or individual biomarkers that are valid as diagnostic tools yet but many of these studies in this review would have benefited from a reframing of question through this predictive science lens since again an important clinical question for the clinicians and definitely for the individual patients is does knowing the value of any of these markers help predict what my the patient outcome is going to be and so if it does we should really consider whether we want to routinely measure it and regardless of how important it might be in a list of other biomarkers and so knowing what the most important factor would be would be a foremost help to the patient and it would allow us to know the risk that she faces. So the first major point here is it would be a more direct benefit if we reframed identifying implicit studies about prediction or risk factor or prognostic factors and ranking them as questions regarding the construction and validation of a model a statistical prediction model that was more straightforward to apply to the individual patient. All right switch gears a little bit and let's just talk a little bit about prediction and and how it relates to medical decision making. So we know that at the foundation of statistical prediction lies the basic concepts in medical decision making and you have to understand both of these so it's worth stepping back and re-emphasizing that clinical prediction is really at the heart of what we're doing every day with medical decision making. So for example a surgeon choosing the surgical plan in the next month is really the same as picking a procedure which we believe is going to have the best predicted outcome and so if we agree on this principle then it's also important to consider how we optimize decision making and optimum decision making really involves several things it involves making full use of any available data that we might have at our hands. It's forward in time and an information flow in other words we're looking ahead we're not looking back like a diagnostic test we're looking forward at what's going to happen and so there's a stochastic or randomness nature to what's going to happen in the future. It needs to use forward probabilities in that sense and ultimately our decisions when we make a decision we're really applying a loss or cost or utility function to the decision for example that minimizes the expected loss and maximizes the expected utility and so this last point is very important when building prediction models. Very few outcomes in medicine other than death are definitive yes or no and many of the outcomes especially in our field are really on a continuum of yes maybe a little bit so we need to keep that in mind and so therefore what we really need is to predict is not whether someone will have an outcome or not right that's classification what we want is the probability that one may have the outcome or not and this probabilistic prediction is then taken into account in order to help us minimize the loss or utility with things like minimizing an adverse event or minimizing cost of care and so forth and so to really understand clinical prediction clinicians need to have this really firm grasp on probability so not only decision making and understand how it links but you have a firm grasp on probabilistic prediction of probability and clinicians unfortunately generally are not taught enough about how to apply probability in practice and this can limit their ability to link the research findings to clinical practice so that being said we unconsciously use probability all the time because based on our training we sometimes shoot ourselves in the foot in an effort to simplify our clinical decisions for example many good researchers and clinicians in our field say well we need a hard fast rule so we know how to diagnose or treat patients because our decision is either do it or don't do it we for example we need a hard cut off in the pop q stage or maybe we need a specific pop q values before we're going to act or maybe we need to cut off on the minimally important difference on the udi scale or a cut off in the percent reduction in incontinent episodes right so what you're hearing me do is listen common outcome measures that are used in our research when in fact they're artificial dichotomizations that in many ways don't reflect the actual patient care and decisions that we need they're just a decision so that we can do the study and so what we also say is that because because of the fact that we either treat or don't treat the patient we don't want to consider the probability of disease but we want a simple sort of classification loop and so in research what happens when we do our research with this way is it makes the clinician sort of influence the statistician to try to use what ultimately become inefficient arbitrary methods of categorization stratification or matching all right so it's also important to distinguish between prediction and classification so here we use a hypothetical surgical site infection model which we have been evaluated here at Duke so let's suppose we have a model that can predict the probability of a surgical site infection after urogynecologic surgery the predictive probability of a patient is let's say 10 percent and one possible choice or decision choice is the choice to give them some intervention after the surgery so let's say we want to give additional antibiotics all right again this is a hypothetical now we could easily add the intervention of some other intervention like a wound back for an abdominal case or education around how to prevent infection or maybe we want to add an intervention of early follow-up in the clinic to prevent them from being readmitted all right so we could have a number of different decisions based on that and so what happens is is that the care team is considering a number of costs or utilities that are involved in the decision on whether or not we want to intervene and these costs or utilities are not modeled by the prediction tool right they're outside of the realm of the prediction tool and so this is considered what we would consider a probabilistic model the prediction does not usurp the decision maker in other words the decision maker is allowed to take to incorporate outside information that's not included in the modeling process into making a decision okay that's a probabilistic model now let's contrast this with the same patient and same model but instead of simply providing a probability or risk of surgical site infection the tool also provides guidance on what to do right so for to do this we need a threshold of let's say let's just hypothetically say greater than five percent leads the tool to say give antibiotics okay so if the probability had been less than or equal to five percent it would say don't give antibiotics so this model is what we would call a deterministic model and at first glance this this appeals to a lot of clinicians and it sounds like what we want but there's several reasons why this type of situation should probably be avoided this classification situation all right so what's happening is in many clinical decision contexts classification represents a premature decision because what you're doing is classification combines prediction and decision making and it's trumping or usurping the decision maker and specifying the cost of wrong decisions so one may think that the decision maker is not really removed but it actually creates an environment and importantly documentation of opposing views when the clinicians ultimately at some point decides against to go against the tool and so the classification rule has to be sort of reformulated if costs and utilities or sampling criteria change and the predictions are really separate from the decisions and can be used by the decision maker so classification is best used when you have non-stochastic or deterministic outcomes that occur very very frequently and they're not used for when you have two individuals with identical inputs again identical surgeries identical risk that can easily have very different outcomes and so for the latter what we want to do our modeling tendencies should be probabilistic and this is a a key point so classification should be used when outcomes are distinct and the predictors are strong enough to provide really something for all subjects but probabilistic or or something where the probability is near one foot for the outcome all right switch gears again and just briefly go over the types of prediction models so they can be presented in very very different ways we have identifying factors risk factors counting risk factors we can get integer points for risk factors we can do a decision tree a nomogram these can be incorporated into an online calculator and of course what we currently have is our best clinical judgment so we know that identifying risk factors is common researchers do a univariable or multivariable analysis perhaps they mistakenly use a stepwise variable selection method and they end up with a list of statistically significant predictors let's say p value less than 0.05 or 0.1 and these variables are typically represented in their odds ratios hazard ratios or other coefficients and maybe their p values so the most severe weakness, of course, with this approach, as we've already discussed with the NISQIP transfusion example, is that the list of risk factors is not actionable. One cannot easily calculate a patient's risk of some outcome with a list. And so, furthermore, the predictive performance of the list cannot be rigorously tested. So, given these limitations, it's often the case that a stronger approach should really be taken. And so, a better method is to assess how much predictive performance decreases when a risk factor is removed from a model containing all of those risk factors. Now, some studies determine the outcome risk by counting risk factors. So, once the risk factors have been determined, and this is an extremely crude way to determine risk, and it's honestly probably not very accurate when rigorously evaluated. And the reason is, is first, a continuous variable has to sort of be categorized in order to count it as a risk factor as being present or absent. Second, by counting the risk factors, they sort of all get the same weight, right, which is unlikely to be optimal. And so, generally, this approach should be avoided. Now, sort of a related approach, maybe a little bit of an improvement upon that approach is to provide integer points for the risk factors. And this is really the same as counting the risk factors, but you're just not giving them the same weight. So, you know, for example, the Caprini risk model for venous thromboembolism, if your age is between 41 and 60, you get one point. But if your age is between 61 and 74, you get two points. And so, similarly, for op time or minor surgery, less than 45 minutes counts as one point. But greater than 45 minutes counts as two points. Now, you know, we know as surgeons, there's nothing that really happens between 44 minutes and 45 minutes, right? But this is just the way that it's done. And so, what you have is this jump or skip jump in the risk. And so, although we certainly overcome the previous approach of assigning the same weight to all the factors, really, it's unlikely that this optimal weight is going to be only one or two times, right? In other words, the weight might be 1.3 or 2.2 and so forth. So, what happens here is the integer rounding effectively loses predictive information, which ultimately reduces accuracy. And the categorization of continuous variables, such as maybe in this case, age or op time, is really a limitation of this approach. But you'll see it a lot. Decision trees are long popular. And we all know what a decision tree is. The patient's information is entered into a tree. It has a series of questions, such as the presence or absence of a factor. And ultimately, the patient is classified into a risk group, such as low, moderate, or high. And these are very easy to use. They can be memorized, sometimes put on the back of a card that people can wear on their badge. And the issue is that the most, the issue is really, or the problems are most pronounced in the high-risk group. And so, a patient may barely qualify as high-risk or may have every single risk factor that makes them high-risk, right? And so, as a result, that some patients in the high-risk group are really sort of a lower risk and some are really, really very high risk. And so, the decision tree tends to have high heterogeneity in the highest-risk groups. And so, the decision tree works reasonably well at a group level. If you're trying to triage everyone for one thing and you're not worried about the tradeoff cost of treating everyone or testing everyone, it works reasonably well. But for an individual patient, making a decision for individual patients within the same risk group, for example, should I have a surgery for my prostate cancer or should I just watch or treat medically? If you're in that highest-risk group, it's quite heterogeneous. And so, this is exactly what we see with some of the VTE decision trees is that when you look at the risk categories, there's a significant amount of heterogeneity among the risk groups. So, for example, a person in the highest-risk category has a risk ranging anywhere from 40 to 80% chance of the VTE, and that's a wide risk. And, you know, should the patient with a 40% chance be treated differently than 80%? Maybe for this condition, no. Maybe for this condition, yes. If that's cancer, that's a big difference, right? So, you've got to interpret it in perspective with what disease process you're talking about. But, you know, in this study, the authors concluded that the PAPRENI model accurately predicted gyn-oncology patients at the highest risk experiencing VTE, and they're correct. This is because the model predicted that almost all the patients would be at high risk. In fact, 92% would be high risk. So, there's one way to make your model perform well in one category is just train it to predict everyone to be in it. But this model has, so in essence, this model has high classification accuracy at the highest risk. The problem is, is it sacrifices calibration across the range of low and moderate risk groups. Another example from prostate cancer shows what happens when you categorize patients into groups. Each X on the figure is a patient stratifying the prostate cancer into three categories. The Y axis demonstrates their predicted probability by model. And notice the considerable overlap of the outcomes between the intermediate and the high risk groups, suggesting that simply estimating disease recurrence by a stratification process may not be sufficient for predicting outcomes among individuals. Again, decent at a group level, but for individuals, it's problematic. Because if you're predicted to be in the high risk category, look at your actual probability range. Well, it's from zero to one. And so, clearly, it's sacrificing something to put people in the high risk group. Now, nomograms are highly attractive for displaying the outcome risk in a paper form. So, if you're going to publish a paper, it's often very useful for people to see these. These are the graphical depictions of a regression model. And these can be done for logistic regression, linear, Cox regression models, and so forth. And they overcome the challenges of many of the alternatives. Because what they don't, they don't force identical integer weights on the factors. And they do not require continuous variables to be dichotomized or categorized. And, in fact, the effects of continuous variables can be very well represented in a nomogram, even when those effects have nonlinear effects. And so, for these reasons, a nomogram is good for depicting a regression model and is really the preferred way to present a paper-based statistical prediction model. For practical reasons, the online calculator is highly attractive. They're in widespread use in a lot of different areas. And as long as your model is well described for peer review, there's links to the peer review literature, this should really be the means, preferred means of risk calculation for individual patients. Risk calculators can handle very sophisticated equations. And, at this point, are likely to provide more accurate predictions for an individual patient. And so, these can be simple or complex. The online calculator on the left predicts one outcome, transfusion after gynecologic surgery. The online calculator on the right predicts 28 different model or has 28 different models with 28 outcomes, recurrence rates, complications, and health status after prolapse surgery. So, they can be quite sophisticated. Okay. Let's take another couple learning points here that I think are worth highlighting. The first one is that when you're developing a prediction model, really, the first thing to take notice of is that a prediction model begs for a comparison. Okay. So, it's often useful to compare a new statistical prediction model versus the performance of something like clinical judgment for the same outcome. And the reason is the clinician, obviously, or a group of clinicians, may believe that their judgment predicts the outcome of interest more accurately than a model. And so, a head-to-head comparison on neutral data is really the only way to know for sure which technique is predicted better. Now, many studies have been done that have compared these types of models versus human judgment versus groups of judgment versus experts and so forth. And each of them receives an identical data set and provides their predictions. And those studies overwhelmingly favor statistical prediction models. And so, the figure on the upper right shows the discrimination comparison between the PFD and de novo SUI model compared to PFD and investigators. And also, they compare just the stress test results alone. And just to highlight, the lower right demonstrates how much heterogeneity there is in the expert prediction. So, each plot you see is an expert prediction. And you can just see they're all over the place. And this is one of the reasons why models outperform experts usually. So, the important take-home point here is that any new statistical prediction model really begs for this comparison. So, the first question you want to ask is, what is being used currently? Is it an existing model? Is it clinical judgment? Is it some risk grouping system through a guideline? Or maybe it's sort of a one-size-fits-all estimate. And this is one of the first questions that we talk about when we're dealing with a prediction model, particularly when you're developing a new model, is what do clinicians use now to make the same decision? Now, the second point is, clearly, you want to know, is the model performing as it should? Right? And so, many metrics exist for scoring the performance of statistical prediction models. And they've been thoroughly reviewed in a lot of other places. I'm not going to go through those details here. But you should be familiar with just a few of them, right? And so, briefly, right now, the prevailing school of thought is that accuracy is composed of sort of two areas, discrimination and calibration. Discrimination is measured by the area under the receiver operating characteristic curve, sometimes known as the C-index or the concordance index. And the concordance index is really, the definition is that it's a probability that a randomly drawn positive case will have higher predictive probability than that of a randomly drawn negative control. And I will note here that the C-index is good for describing the predictive discrimination of a single model. The problem is, is it's not sensitive enough for comparing two models. So, that's an important note. And really, for that, you need a more proper accuracy score, such as likelihood ratios, chi-square tests, and there are several others you can use. Now, it is better to use proper accuracy scores. So, what I presented before are sort of commonly seen, you'll almost always get reviewers to say this is what they want to see. But it's actually better to use a more proper accuracy score. And a proper accuracy score in the rule for prediction is a metric applied to, again, the probability forecast. So, what you want is a continuous accuracy score rule that is a metric that makes full use of the entire range of predictive probabilities and does not have any jumps or large jumps because of these infinitesimal changes in predictive probability, which is what the C-index will happen to do. And so, the two most common proper accuracy scores are the Breyer score and the logarithmic accuracy score. And for continuous outcomes, this is really just simply the mean absolute error. So, what you'll see us try to do is present all of those metrics in a report. Now, calibration is typically assessed graphically. And it's a plot of predictive probability versus the observed proportion. So, in the right figure, the x-axis is the predicted probability from a model. The y-axis is the actual probability of the outcome. And a perfect model will result in a 45-degree line. Now, when the predictions are over, when the model is over predicting, the line falls below the 45-degree line. So, it's opposite of what you would think. And when the model is under predicting, the predictions fall above the 45-degree line. So, in this example, you can see the model is slightly over predicting risk once you get around the 50 to 60% probability. Okay. Now, systematic reviews of the literature have shown that calibration is assessed far less often in the literature than is discrimination. And this is extremely problematic because a poorly calibrated model can make predictions very misleading, especially for clinical use. Now, this image demonstrates the performance of a model at multiple sites across the country. So, we have this one model. And we're looking at calibration of the model at all the sites. Okay. And so, each line represents a site. The light gray line is perfect calibration. The model's overall area under the curve was over 0.8. So, you know, what most people would agree is a reasonably good discriminatory model. The figure shows that the model performance really has a lot of difference by site. So, for most sites, the model over predicts risk at low risk and under predicts risk at high risk. And there are clearly several sites, which you can see by the lines, that are concerning. And the model's not performing well at all at those sites. But overall, the C-statistic is above 0.8. So, when you're reviewing a paper on a prediction model with a dichotomous outcome, you should demand the calibration plot. And even a continuous outcome, which is just predictive risk versus actual, predictive value versus actual. But demand a calibration plot when you're evaluating a prediction model can be very informative about performance. Why? Well, because a poorly calibrated model makes the algorithm less useful than a competitor algorithm that may, in fact, even have a lower area under the curve of C-statistic, but it's very well calibrated. And so, these illustrations show different types of miscalibration. Every one of these lines is a model that has a C-statistic of 0.71, but they all have very different forms of calibration. And this is important. The left figure shows general over or underestimated predictive risk, right? We talked about that. The line below is over and the line above is under predicting risk. But the right figure shows that the predicted risks are too extreme or not extreme enough. And these issues become very, very important when you as a clinician sit down to make a decision about what to do with the patient. Since just because a model has an area under the curve of 0.71, it only told you part of a story. The models, these different models are completely miscalibrated during certain areas of the predicted risk. And so, the implication here is that when using predicted risk, the half of the actual risk is about calibration and not really about discrimination. Now, there are certainly some limitations to using prediction model approaches. Studies have shown that most adults do not know what a percentage is. And therefore, a better way to convey risk is by framing a prediction model as the number of 100 or 1,000 that will experience the event. It's even better is to use something called an icon array, which a basic one is shown here in this image. In this image, we have 23 out of 100 women who predicted, 23 out of 100 women are predicted with these characteristics to require short-term use of a catheter for urinary retention at the Botox. So, this is some work done by Whitney Hendrickson, one of our fellows, to build a prediction model using some trial data for Botox. And again, it just shows the predicted risk of needing a catheter. So, secondly, the other point is that prediction models have been criticized for not providing uncertainty regarding predicted probabilities. And oftentimes, we'll hear people who want a 95% confidence interval around this prediction. But really, confidence intervals are not relevant to the individual patient. Now, that being said, a model from a small data set with few events will certainly be less reliable than one built from a larger sample size. And the other downside, of course, is that a computer might be needed, right? And if a computer is unavailable, then the nomogram should be used rather than counting risk factors. So, while an extra few seconds for a more accurate prediction from a computer seems worthwhile, it's definitely a tradeoff, particularly in a low-resource setting. Here's an example of a model incorporated into our EHR. We call two smart phrases on the left are called by the physician writing their note. The model calculates the risk of transfusion. And then on the right, you can see if you wanted to see actually what is driving the model, you'll see the factors that are in that model and then how the final score equation is calculated. Being transparent with models to clinicians who are using this in the clinical realm is an important part of deployment of a model. Another part is ensuring that the model continues to perform as expected over time. And so, using data from the EHR, we can monitor the predictions and the outcomes for all patients over time and make adjustments. And adjustments will be needed. Model drift is a very real phenomenon that happens. And before models are deployed long-term in electronic records at institutions, it's highly recommended to have a monitoring plan in place. All right, so what about other approaches like maybe machine learning approaches? So everyone's probably familiar here that the latest trend is to use more complex machine learning models in place of statistical models. They're more complex. I guess the message is, is that more complex models do not necessarily equal better accuracy. Traditional regression models should really be the default approach. They're easier to interpret and understand, and they should be compared to really more complex models. So far, there's a lack of evidence supporting most machine learning models over traditional regression, and traditional models are easier to interpret than machine learning models. There was a recent paper that showed that boosted trees might be as good as statistical prediction models in situations where you have a lot of samples and a low number of predictors. So really, that gives credence to continuing to at least use traditional statistical models for baseline comparisons and testing more advanced models and comparing them to those. So this is a recent systematic review of studies compared models, logistic regression, and machine learning. Obviously, the review found that the findings were sort of incomplete and unclear. As I mentioned before, calibration was seldom examined. But really, when you looked at the areas under the curve or see statistics of when comparing a regression to a machine learning model, on average, there's no difference when these comparisons had low risk of bias, which is exactly the type of situation you want for a clinical prediction model. All right, what do we mean by validation? So in general, it's essential that a statistical prediction model is assessed for its ability to predict in future samples. And so the key is that the performance of the model needs to be done on data that were not used to initially build the model. And so what you can do is this can be done internally off the same source of data used to build the model, or you can externally do it using data generated from some different source. And there are many variations on doing internal validation, such as temporal validation, which is splitting the cohort by time. And then of course, there's cross validation, which is shown here in the lower left figure, and bootstrapping approach, which tends to be the more robust approach in our situation. So essentially, what this means is you take your existing sample, you split, if you're doing the five-fold cross validation in the lower left, you split your sample in five or in different groups, and each group has a training group and each group has a test group. And you build the model on the training group, the little orange area, or on the blue area, the larger proportion, the blue area, and you test it on the little orange area. Then you go back, you do another split, only this time you have a different set, you train it again on the blue area and you test it on the orange area. Then you go back, then you do it again. You repeat the process as you go through all of the folds. And then you average the folds and you get a performance. That's five-fold cross validation. And you can do 10-fold, you can do a random number of folds as well. And when you just randomly start splitting the data set, that's bootstrap. So you just randomly split it, you build the model, you test the model on what's left. You randomly split it again, you build the model on a proportion and you test it on what's left. So in essence, this allows you to do a train and a test set on the same data set that you have, but the model was trained on data that did it, it was only trained on data it didn't see in the training set portion and tested on the holdout section. So finally, one of the major contributions has been the tripod statement. And you should know about this. This is a checklist of items that ought to be reported when publishing a model. Whenever I get a paper to review, I don't know how many times I comment as almost the first comment in a peer review article and say, if the investigators had simply followed the reporting guidelines, their paper would be a much, much better. So I encourage people to use this like any reporting guideline. It ensures that you're reporting the accurate things and that you've done a sufficient job. It's not a strict, you have to do it, but the guidelines are put together by people who are experts in this area and they're reasonably good to follow. There is a separate tripod for abstracts and that's important to understand that there's two different ones. And the tripod for abstracts is perfect because it's a succinct guideline on what to report for models. So if you haven't seen those, take a look. The other one is the 2016 American Joint Commission on Cancer, which also has its requirements for models. All right, so, so far we've covered two important, we've covered several important contemporary challenges in the clinical prediction models and its relation to medical decision-making. And we talked about the importance and relevance of reframing risk studies through a predictive lens. We emphasize how to improve the quality of our reporting and what to look for in peer review of clinical prediction models. But there are two probably sort of fringe right now, but coming challenging areas that are emerging in the relationships between prediction algorithms and regulation and clinical prediction tools and health disparities. So this is an FYI, the FDA is actively addressing how it might regulate computerized decision aids. And the algorithms, these algorithms, as you know, not necessarily in urogynecology, but in other areas of medicine are embedded in a lot of the things that we use. They're embedded in imaging, in lab testing, in devices such as EKGs and even neuromodulator devices. And so the FDA has issued some guiding documents in this area, which are listed here. And what this is doing is this is prompting some health systems to generate guidance in oversight and the use of the algorithms in their health systems. So I'm not gonna go into the details of these issues, but it is worth placing on your radar that if you have an interest in any of these areas, you might just wanna be familiar with this document and the people who might be able to navigate you through it if you have an interest. Now, the machine learning community, both within and outside of medicine, all right, have certainly become alert to the ways that predictive algorithms can inadvertently introduce unfairness in decision-making. And it's important to be aware of this since these can often generate very emotionally charged discussions. And sometimes we see it in the press, especially with the COVID stuff coming out more recently. But central to this process and maybe an important point to think about is that there's a distinction really between two concepts. And that is algorithmic fairness is the model being fair and algorithmic bias, okay? And fairness concerns apply really when an algorithm is used to support what we call a polar decision. In other words, one pole of our prediction leads to a decision that is generally more desired by an individual than an other, such as when predictions are used to allocate a scarce health resource like a COVID vaccine or transplant of organ, for example. To a group of patients that really everyone could benefit from having it, but we're only gonna give it to this group. So in essence, that is a polar type of decision. So understanding the specific decision context, and again, I'm gonna circle back on why it's important to understand clinical prediction and decision-making because these are intricately linked and they play a huge role in these decisions about bias and fairness in algorithms. So understanding the decision context for a prediction model in healthcare is necessary to anticipate some unfairness. So in the medical context, particularly the shared decision-making context, which we often use, usually patients and physicians share a common goal of an accurate prediction so that we can help balance the benefits and risks and harms for care individually. So think of our de novo SUI model for prolapse surgery, both the patient and us as a clinician, we have the same goal here, and that is we want to provide a good shared decision-making context, and we both want the same thing. We want the patient to be dry, right? So predictions supported in this type of context are described as non-polar as shown in A in the figure. So on the other hand, when one pole is associated with a clear benefit or a clear harm, then the prediction is described as polar. And in cases of a polar prediction, the decision-maker's interest, right? The clinician maybe, but it could be another decision-maker, is in the efficient decision-maker. And it's not aligned necessarily with the subject's interest or the patient's interest to have either a higher or lower prediction. So again, a higher prediction, I'm eligible to receive a COVID vaccine, you're not, or micro-allocation of transplanted organs, or let's say we want a lower prediction for screening for substance abuse. So it's my goal to have the lowest prediction to be screened for substance abuse, okay? So the problem here is that positively polar predictions correspond where patients have an interest to be ranked high, but they receive to receive a service that may be available only to those who can benefit, and that's in B. And then in the distinction for a negatively polar prediction where the prediction is used to target an intervention perceived as punitive or coercive, such as involuntary commitment, right? Or screening for child abuse, as shown in C. And so the bottom line here is that these issues of fairness pertain specifically to predictions used in the decision context. And when a decision context induces predictive polarity, these are important to highlight and to discuss. So what do we do about it? What can we do about it if we're building models? Well, the jury is still out on this, but there's some early guidance. And the main thing is, is we want to mitigate bias and unfairness in our evaluation process. And so recent work has suggested that for models that are going to be used as decision aids for balancing harms and benefits, when the clinician or decision maker and the patient are aligned, like we mostly use in urogynecology, that A, we use a representative sample for the model development. We've sort of known that already. And that we evaluate the performance overall and in each race or gender class, if that were to apply. So if there's no evidence of subgroup invalidity in that process, then we consider labeling bias as a potential source, labeling bias as a potential source of bias. But if there is evidence of invalidity in that process, let's say we built a model and it was inappropriately calibrated in a minority race, right? So in that case, then we may want to consider some interaction terms or a stratified model. And of course, as we would anyway, in sampling bias. But for models that are used in rationing care, it's a slightly different approach. And in that case, what we wanna do is to, a suggested approach is to simply restrict the inputs of the models and make this race unaware. In other words, don't include race in the model because there's a polar decision that emerges from it. Now, another approach suggests using different decision thresholds or applying some fairness constraints. And so this is early work in this space. And I suspect that more information will be forthcoming across a variety of disciplines in the future. All right, I'll end the presentation. This is probably about what the sky looks like here, but this is at the North Carolina Outer Bank. So thanks for your comments. I hope this was interesting and I look forward to questions. Well, thank you. I think we still have about a little less time, but maybe five or six minutes here for some questions. And so for anyone that does have questions, please submit them in the Q&A session and I can see them. I actually, just while we're waiting for people to put in questions, I have a question because you mentioned for those of us that may be, or those people that may be interested in getting involved in some of this stuff. So who at your institution, if we don't have somebody with your skillset, who in our institution will be looking at to try to partner with to try to do some of this stuff? Well, yeah, that's a good question. And I think, at least in my experience in working with individuals at Duke, that obviously your biostatisticians may have some background, although don't be surprised if their knowledge base is not as deep as you might want it to be. I've encountered a high degree of heterogeneity in the knowledge about prediction work from biostatisticians in my experience. And so I think the key is, is really talk to a lot of people, the chair, the leadership and say, hey, who's really doing this work? Who's been doing it for a while? And contact those. I mean, that's pretty much the same as we would for any type of sort of nuanced statistical project. Now, sometimes you can get people trained in informatics or computer science, or one of the engineering disciplines who can help. The engineering, a lot of the engineers now are doing this type of work outside of medicine and in other realms, and they may have some experience. They will definitely need some of the context around the clinical pieces, but they often have knowledge and the skillset, the programming skillset to be able to do a lot of this, and the terms are very familiar with them. So I would typically start with the biostatistician and maybe some other clinicians who may have done some work in the space. If you're not getting anywhere with that, reach out into maybe some of the other resources at the university that you might have, such as engineering, computer science, informatics, and those spaces. All right. Well, thank you. Okay. We have a question here from Dr. Robeson. Does the Uruguayan CREST Program address developing prediction modeling skills? Is that something someone can get experience in via that route? Yeah, so that's a good question, Elizabeth. The short answer to that is yes. Our first cohort of fellows, we had several people interested in that, and I know I mentored one, David Schein, at Case Western, whose sole project was a prediction project, and he submitted that to AUGS, actually. So we'll see how that works out. And so, yes, if you have ideas about prediction modeling work, and you can particularly use the large data sets that the Uruguayan CREST Program offers, I would encourage you to do that. And if you don't have any ideas about what the Uruguayan CREST Program offers, I would encourage you to apply and think about that, because we can certainly help you in that process. That's no problem. Now, the depth and the level at which you might be able to do the prediction modeling work yourself really is highly dependent upon your motivation to learn coding. If you're willing to learn how to do some code, and you're willing to put the work into it, you can do that. The statistical knowledge, and even the basic statistical packages in SAS and Stata, and that aren't, are sufficient to do some of it, but the in-depth programming certainly helps you in many of these areas. And I know with our fellows, they often want to do it, and they ultimately end up putting the time in and learning how to do some hard coding, which is the steep hill to climb in a three-year clinical fellowship. So we're proud of them when they get there. But in a two-year Urucrest Program, you could do it. It's just, you gotta be motivated to do it. But we can help you accomplish it either way. That sounds like a lot of hard work, for sure, doing that coding. So I think if there are no other questions, we're heading to the end of our hour. On behalf of Augs, I'd like to thank Dr. Jelocic and everyone for joining us today. The next FPRMS webinar will be Wednesday, May 12th at 7 p.m., and please look to the Augs website to sign up. Yeah, and I would just say, if anyone has any specific questions or project-related ideas, feel free to shoot me an email. I really don't mind helping people who wanna know a little bit about it. Feel free to shoot it our way. Thank you. Thank you, everybody. Have a good evening. All right, good night.
Video Summary
Dr. Eric Jelocic, an Associate Professor of OB-GYN at Duke University, presented a webinar on Health Services Research at the FPRMS. He discussed the development and validation of patient-centered prediction tools for various women's health conditions, including pelvic floor disorders and urinary incontinence. Dr. Jelocic highlighted the importance of clinical prediction models in improving patient and clinician decision-making and emphasized the need for accurate and well-calibrated models. He also discussed the challenges and limitations of different types of prediction models, such as risk factor identification, counting risk factors, decision trees, and nomograms. Dr. Jelocic stressed the significance of validation in assessing the performance of prediction models and shared the Tri-Pod statement, which provides guidelines for reporting a model's development and validation. He also touched on emerging issues related to the regulation of prediction algorithms and the potential for algorithmic bias and unfairness in decision-making. Dr. Jelocic concluded the webinar by encouraging clinicians to consider incorporating prediction models into their practice to improve patient care and shared decision-making.
Keywords
Dr. Eric Jelocic
patient-centered prediction tools
women's health conditions
clinical prediction models
validation
Tri-Pod statement
algorithmic bias
decision-making
patient care
×
Please select your language
1
English