Opinions are sharply divided about secondary endpoints. Many believe that they are largely worthless (“hypothesis-generating only”). Meanwhile, others are willing to use them to guide management.
This post will attempt to create a rough framework for analyzing secondary endpoints. This is primarily intended as a springboard for debate, rather than a final answer to this thorny issue (one which has remained unresolved for decades).
Secondary endpoints aren't for hypothesis generation only
It is widely believed that secondary endpoints are good only for hypothesis generation. This belief can be disproven as follows. Safety endpoints are a secondary endpoints. If a study shows as a secondary endpoint that a given intervention causes harm, that is something that should be taken seriously (often affecting management). Thus, the concept that secondary endpoints should never affect management is invalid.
The ugly truth is that data quality isn't an all-or-none phenomenon. Instead, there exists a continuous gradation between low-quality and high-quality evidence. The primary endpoint of a multicenter RCT is the highest quality, but this is a rare treat in critical care. In practice, we are often forced to guide our clinical practice with evidence of intermediate quality.
This post will attempt to flesh out what I will call a “major secondary endpoint.” This is undoubtedly inferior to a primary endpoint. However, a major secondary endpoint might provide a fair quality of evidence when interpreted judiciously.
Requirements of a major secondary endpoint
#1. The secondary endpoint must be pre-specified.
Large RCTs will publish a data-analysis plan before embarking on the study. Secondary endpoints should be pre-specified. This prevents post-hoc fishing for outcomes which can easily turn out to be positive by chance alone (1).
#2. It could conceivably have been the primary endpoint.
For most studies, there are a few endpoints that could be chosen as the primary endpoint. A major secondary endpoint should be one of these. That is, it is conceivable that the study could have been justified and published with that endpoint pre-identified as the primary endpoint, rather than a secondary endpoint.
#3. The entire study population must be included in the analysis.
A major secondary endpoint shouldn't originate from a subgroup analysis.
#4. Only 1-2 major secondary endpoints.
Major secondary outcomes should be limited to 1-2 outcomes, which are intimately related to the primary outcome and to the general hypothesis being tested. Limiting the number of major secondary endpoints ensures that they are truly of central importance.
#5. Major secondary endpoints shouldn't be more distal from the intervention than the primary endpoint.
The power to detect an effect decreases as we examine endpoints which are more distally removed from the intervention. To achieve sufficient power, a major secondary endpoint should be at an equivalent or greater proximity to the intervention compared to the primary endpoint. For example:
- If a study is powered to examine a primary endpoint of ventilator-free days, it probably won't be adequately powered to evaluate mortality.
- If a study is powered to examine a primary endpoint of mortality, it probably will be adequately powered to evaluate ventilator-free days.
#6. The original study should be large and high quality.
Secondary endpoints are only as good as the study in which they are examined. The better the study, the more robust the secondary endpoints will be. A major secondary endpoint from a large, rigorous, multicenter RCT is likely to be more convincing than the primary endpoint of a small, poorly designed, single-center RCT.
Statistical issues regarding type-I & type-II error
Secondary endpoints involve statistical pitfalls, but these may often be overcome.
Type I error (“false-positive result”)
Type-I error refers to the risk of incorrectly concluding that there is a treatment effect, when in fact none exists. It’s easy to run into problems with type-I error using secondary endpoints. If numerous secondary endpoints are investigated, then it becomes increasingly likely that one will be positive simply due to random chance.
There are various ways to correct for performing multiple statistical tests. A common strategy is the Bonferroni correction, where the target p-value for each individual test is set equal to 0.05 divided by the number of tests being run (e.g. if two tests are being performed, the p-value cutoff would be 0.05/2=0.025). Davis 1997 suggested that for secondary endpoints, the p-value cutoff should be 0.05/(k+1) where k is the number secondary endpoints being evaluated (2). This would yield a cutoff of p<0.025 or p<0.01 for one or two major secondary endpoints, respectively (3).
One potential problem arises here. Suppose the primary endpoint is judged using a p<0.05 cutoff and two major secondary endpoints are judged using a p<0.01 cutoff. If the intervention doesn't work, there is a combined probability of p=0.07 (7%) of incorrectly concluding that the intervention worked based on at least one outcome. A possible solution to this is as follows: If the primary endpoint is technically “positive” with a borderline level of significance (e.g. p=0.03-0.05), but this result is contradicted by secondary endpoints, then the result of the primary endpoint should be rejected (4). By allowing secondary endpoints to veto the primary endpoint, the risk of type-I error using the primary endpoint is reduced below 5%. This allows the combined type-I error of the primary plus major secondary endpoints to remain roughly 5% (5). Please note that this concept of using clinical judgement rather than a rigidly pre-specified statistical analysis plan is extremely unorthodox. However, it may at times reflect the way clinicians actually interpret studies.
Type II error (“false-negative result”)
The study isn’t designed to ensure that secondary endpoints are adequately powered. This creates the possibility that secondary endpoints could be underpowered, leading to type-II error. Focusing on more proximal secondary endpoints should tend to avoid this problem (see #5 above).
Whenever possible, the power of the secondary endpoint should also be evaluated directly. One simple way to gauge this is to look at how wide the confidence intervals are. If the confidence intervals are wide, then a broad range of results is possible and the test is underpowered. Alternatively, if the confidence interval is narrow and centered on an effect size of zero, then it can be more confidently concluded that the result is truly negative.
- No consensus currently exists regarding how secondary endpoints in large clinical trials should be interpreted.
- Secondary endpoints don't have the same evidence quality as a primary endpoint, but they shouldn't necessarily be ignored.
- It might be possible to designate 1-2 secondary endpoints satisfying several criteria as “major secondary endpoints.” With cautious statistical interpretation, these endpoints may be able to provide a fair quality of evidence.
- There is a disconnect between the way statisticians design trials and the way clinicians interpret them. More work is needed to help us interpret secondary endpoints in a way which is rigorous, but also sensible.
More methodology
Acknowledgement: Thanks to Dr. Gilman Allen for thoughtful comments on this post.
Notes
- Ideally, important secondary endpoints and co-primary endpoints would be identified a priori, along with a statistical plan for managing this multiplicity of endpoints. However, this often doesn't occur.
- A consensus article regarding the analysis of multiple endpoints in clinical trials noted that there are various approaches for analyzing secondary endpoints, with no clear consensus regarding which is best (Turk 2008). Three strategies were described, involving either a total amount of alpha-error of 0.05, between 0.05-0.1, or 0.1. The method described here represents a middle-ground among these various strategies. Please note that in common practice, the issue of multiple comparisons is often entirely overlooked when analyzing secondary endpoints.
- Of course, the truth is that all of these cutoffs are arbitrary (including the mighty p<0.05 cutoff itself). Ultimately, any results must be interpreted in the light of clinical judgement (accounting for the pre-test probability of the hypothesis and the p-value itself).
- The concept of secondary endpoints vetoing the primary endpoint may seem very strange, but this happens occasionally when secondary safety endpoints demonstrate harm due to an intervention. In such a situation, a substantial signal for harm might outweigh a mildly positive signal for benefit observed in the primary endpoint.
- One concept being employed here is called “alpha spreading.” Overall, the total likelihood of a false-positive conclusion should be kept ~5%, equal to “alpha” or the risk of a false-positive conclusion. Traditionally, this entire 5% is spent on the primary endpoint. However, it may also be valid to spend some of the 5% on the primary endpoint, which reserves a bit for the secondary endpoints (D'Agostino 2000).
- PulmCrit Blogitorial – Use of ECGs for management of (sub)massive PE - March 24, 2024
- PulmCrit Wee: Propofol induced eyelid opening apraxia – the struggle is real - March 20, 2024
- PulmCrit wee: Why I like central lines for GI bleed resuscitation - March 13, 2024
Thanks alot for this post as I tend to struggle with these issues and was pondering about this exact thing yesterday reading Tumlins e-print of a post-hoc finding from ATHOS-III:
https://www.ncbi.nlm.nih.gov/m/pubmed/29509568/
Where would you categorize this paper according to the above? Biggest issue for me is the subgroup part. Best regards
Great question. I would classify this as hypothesis-generating only, for many reasons: 1) Its a subgroup analysis (which is more problematic than merely being a secondary endpoint). 2) It wasn’t published in the original study (secondary endpoints should be pre-planned and published in the original manuscript). 3) Running a Fisher Exact Test on the raw data shows that 28-day mortality has a p-value of 0.03 with a fragility index of 3. That’s not bad, but not something I’d consider significant as a secondary post-hoc subgroup outcome. If you consider the numerous post-hoc analyses that were probably performed, given the multiplicity… Read more »
ha! very good Josh. i think many of us “out in the field” would like to think ” just give me the bottom line, without all the pee-stuff… “, but it is that P-value stuff , and the power of the studies that make a practical conclusion worthwhile. and, as Victor’s question below illustrates, we need to be very careful in evaluating primary and secondary results/conclusions. on a second note, i am very hopeful regarding the Angiotensin II med. we shall see. i have been very hopeful regarding presidential elections in the past twenty years, and that hasn’t always panned… Read more »
Thanks, I’m hopeful too that angiotensin-II will work. However, for now I don’t think we have enough data to justify spending a lot of money on it or using it on patients who can be hemodynamically supported by other agents.
Excellent post as always! We recently reviewed the ADRENAL trial in our biweekly Journal Club. I feel most attendings viewed the trial as a negative trial that proves that corticoisteroids should not be routinely used in septic patients. I read with interest your analysis of the trial and i felt that based on this post, an analysis of the secondary endpoints definitely justifies their use, especially given the elegant design of the trial. Will try to integrate some of these concepts in my teaching 🙂 thank you again!
Thanks. Yep, opinions on ADRENAL trial are rather split. I think ADRENAL is a good example of a massive, high-quality study where the secondary endpoints should be paid attention to. For example, you could argue for the use of median time to shock resolution and median time to ICU discharge as major secondary endpoints. More thoughts on ADRENAL here: https://emcrit.org/pulmcrit/adrenal/