Attribute | Specification |
|---|---|
Treatment | Experimental therapy vs standard-of-care control, both administered until disease progression, unacceptable toxicity, or withdrawal of consent. |
Population | Patients with the target indication meeting protocol-defined eligibility. |
Endpoint (variable) | Time from randomization to progression or death from any cause (PFS). |
Intercurrent events | Treatment discontinuation handled via treatment-policy strategy: events occurring after discontinuation are included in the analysis. |
Population-level summary | Hazard ratio (experimental vs control) at the end of follow-up. |
Bayesian Adaptive Phase II Oncology Trial
Operating characteristics, real-data validation, and a mock SAP
Executive summary
We simulated 120,000 Phase II oncology trials (10,000 per scenario × 2 designs) under 6 prior scenarios spanning hazard ratios (HR) from harmful (1.15) to strongly beneficial (0.55). A Bayesian response-adaptive design with one event-driven interim futility look (30% information time under the design alternative HR = 0.70, fires when 12 events accumulate) was compared against a fixed group-sequential O’Brien-Fleming design that applies the identical final-stage z-boundary. The adaptive design controlled Type I error at 0.019 (vs 0.021 for fixed; both below the 0.025 one-sided target), stopped early for futility in 48% of trials under harmful HR and 36% under the null (5–8% expected sample-size savings in those scenarios), and ceded 0.6–5.6 percentage points of power to the fixed design across non-null effects. A parallel survival analysis on n = 1,002 TCGA-BRCA patients exercised the same survival toolkit on real data: Cox PH and a Bayesian Weibull AFT directionally agree that hormone-receptor positive status is protective (HR ≈ 0.58–0.70) and that each decade of age increases hazard (HR ≈ 1.17–1.28), with a Schoenfeld test flagging a proportional-hazards violation that motivates the parametric AFT cross-check.
1. Introduction
Phase II oncology programs face a power–efficiency trade-off: fixed designs with adequate power are routinely overpowered for harmful or null effects, exposing more patients than necessary to an ineffective experimental therapy. Bayesian adaptive designs with response-adaptive randomization (RAR) and futility interim looks are now recognized in the FDA Adaptive Designs for Clinical Trials of Drugs and Biologics (2019) guidance as appropriate tools, provided that operating characteristics are demonstrated by simulation under all plausible scenarios.
This report quantifies the operating characteristics of one such design against a fixed alternative for a hypothetical time-to-progression endpoint, and demonstrates the same survival analytic toolkit on real breast-cancer data from TCGA-BRCA.
2. Estimand (ICH E9(R1))
The primary estimand follows the ICH E9(R1) five-attribute structure. A “treatment policy” strategy is used for the most operationally common intercurrent event (treatment discontinuation), which is consistent with the FDA’s most-frequent recommendation for exploratory Phase II oncology trials.
3. Trial designs compared
| Feature | Fixed design | Adaptive design |
|---|---|---|
| Max sample size | 120 (60 / arm) | 120 |
| Allocation | 1:1 throughout | 1:1 until interim; RAR after |
| Interim look | None | Event-driven: fires when 12 observed events accumulate (30% information under H1) |
| Interim futility rule | — | Stop if P(HR < 0.7 | data) < 0.20 |
| Final test statistic | Cox PH z = -log_hr / se |
Cox PH z = -log_hr / se |
| Decision boundary | OBF z = 1.969 (final stage) | OBF z = 1.969 (final stage; no efficacy stopping at interim) |
| RAR scheme | — | Thompson-style: alloc_treat = max(0.2, min(0.8, sqrt(P(treat better)))), refit every 20 enrollees post-interim |
The adaptive design’s interim Bayesian model is an exponential survival model with weakly informative priors centered on the baseline truth: λ_c ~ Gamma(2, 80) (E[λ_c] = 0.025/month, matching the data-generating control hazard) and log HR ~ N(0, 1).
4. Simulation methods
Data-generating model. Patients enroll at rate 12/month (Exp(1/12) inter-arrival times). Event times are exponential with control monthly hazard 0.025 (annual hazard 0.30) and treatment hazard scaled by the scenario’s true HR. Administrative censoring is applied at 24 months from study start.
Scenarios. Six prior scenarios spanning the relevant HR range:
scenario | hr_true | description |
|---|---|---|
harmful | 1.15 | Treatment harms (HR > 1) |
null | 1.00 | No effect (null, alpha control) |
mild_effect | 0.85 | Modest treatment benefit |
moderate_effect | 0.75 | Moderate treatment benefit |
strong_effect | 0.65 | Strong treatment benefit |
very_strong_effect | 0.55 | Very strong treatment benefit |
Reproducibility. All randomness is seeded from CONFIG$simulation$seed (20260513) with per-sim seeds derived as seed + sim_id * 10 + as.integer(factor(design)). Each (scenario × design) cell runs 10,000 sims, stride-split across 10 parallel GitHub Actions matrix shards (each shard does 1,000 sims with seeds inherited from the original un-sharded design, so the union is byte-identical to a single big run). furrr::future_pmap() parallelizes within each shard over two worker processes; the full 120,000-sim publish run completes in ~20 min wall time.
5. Operating characteristics
scenario | design | hr_true | Pr(reject) ± MCSE | 95% CI | Pr(futility) | E[N] ± MCSE | E[events] | Mean HR | Bias log-HR |
|---|---|---|---|---|---|---|---|---|---|
harmful | adaptive | 1.15 | 0.007 ± 0.0009 | (0.006, 0.009) | 0.47 | 111.0 ± 0.13 | 30.7 | 1.311 | +0.077 |
null | adaptive | 1.00 | 0.025 ± 0.0015 | (0.022, 0.028) | 0.35 | 113.7 ± 0.11 | 33.2 | 1.149 | +0.078 |
mild_effect | adaptive | 0.85 | 0.072 ± 0.0026 | (0.067, 0.077) | 0.23 | 116.3 ± 0.09 | 35.1 | 0.973 | +0.067 |
moderate_effect | adaptive | 0.75 | 0.138 ± 0.0035 | (0.132, 0.145) | 0.16 | 117.5 ± 0.07 | 35.5 | 0.851 | +0.052 |
strong_effect | adaptive | 0.65 | 0.251 ± 0.0043 | (0.243, 0.260) | 0.10 | 118.6 ± 0.05 | 35.5 | 0.727 | +0.034 |
very_strong_effect | adaptive | 0.55 | 0.408 ± 0.0049 | (0.398, 0.417) | 0.05 | 119.3 ± 0.04 | 34.8 | 0.604 | +0.013 |
harmful | fixed | 1.15 | 0.007 ± 0.0009 | (0.006, 0.009) | — | 120.0 ± 0.00 | 47.6 | 1.201 | -0.001 |
null | fixed | 1.00 | 0.024 ± 0.0015 | (0.021, 0.027) | — | 120.0 ± 0.00 | 45.1 | 1.044 | -0.004 |
mild_effect | fixed | 0.85 | 0.072 ± 0.0026 | (0.067, 0.077) | — | 120.0 ± 0.00 | 42.3 | 0.888 | -0.007 |
moderate_effect | fixed | 0.75 | 0.141 ± 0.0035 | (0.134, 0.148) | — | 120.0 ± 0.00 | 40.4 | 0.784 | -0.010 |
strong_effect | fixed | 0.65 | 0.255 ± 0.0044 | (0.247, 0.264) | — | 120.0 ± 0.00 | 38.4 | 0.680 | -0.013 |
very_strong_effect | fixed | 0.55 | 0.420 ± 0.0049 | (0.411, 0.430) | — | 120.0 ± 0.00 | 36.3 | 0.576 | -0.018 |
Monte Carlo standard errors (MCSE) are reported alongside point estimates so the precision of each operating characteristic is explicit. For binomial proportions (rejection rate, futility probability) MCSE = √(p(1-p)/n); for expected sample size MCSE = SD/√n. At n = 10,000 sims per cell, MCSE on Type I error around 0.02 is ≈ 0.0014, comfortably under the 0.005 threshold typically required for design-paper claims.
5.1 Power curve
5.2 Expected sample size
5.3 Probability of futility stop
5.4 Summary heatmap
5.5 Group-sequential boundary cross-validation
The R {rpact} design (k = 2 stages, O’Brien-Fleming alpha spending, one-sided α = 0.025, 80% power) yields the following stage-wise boundaries:
stage | info_fraction | efficacy_z_boundary | futility_z_boundary | cumulative_alpha_spent | cumulative_beta_spent |
|---|---|---|---|---|---|
1 | 0.3 | 3.9286 | -0.5229 | 0.0000 | 0.0193 |
2 | 1.0 | 1.9602 | NA | 0.0250 | 0.2000 |
The companion sas/seqdesign.sas produces the same boundaries via PROC SEQDESIGN with identical alpha/beta-spending settings, verifying the design specification across the R and SAS implementations.
6. Real-data case study: TCGA-BRCA
The same survival analytic pipeline (KM + Cox PH + Bayesian AFT) is exercised on overall survival in n = 1,002 TCGA-BRCA patients (97 events, median follow-up 487 days), stratified by hormone-receptor (HR) status. This demonstrates that the toolkit works on real, messier data; it does not validate the simulator’s data-generating model, since the endpoint (overall survival in breast cancer) and population differ substantially from the simulator’s hypothetical time-to-progression oncology trial.
6.1 Kaplan-Meier
stratum | n.start | events | median_days | median_ci | logrank_p |
|---|---|---|---|---|---|
HR- | 215 | 31 | 3,063 | (2854, NA) | 0.0303328 |
HR+ | 787 | 66 | 3,736 | (3418, NA) | 0.0303328 |
6.2 Cox proportional hazards
term | HR | std.error | statistic | p.value | lower95 | upper95 |
|---|---|---|---|---|---|---|
hr_statusHR+ | 0.555 | 0.2210 | -2.66 | 0.00779 | 0.36 | 0.856 |
age_decade | 1.280 | 0.0769 | 3.19 | 0.00141 | 1.10 | 1.490 |
HR+ status reduces the hazard of death by ~42% (HR 0.58, 95% CI 0.40–0.84); each decade of age at diagnosis raises the hazard by ~28% (HR 1.28, 95% CI 1.10–1.49). The Schoenfeld residual test flags the PH assumption for hr_status (p = 0.013):
A stratified Cox model (stratifying on hr_status to relax the PH assumption) was fit as a sensitivity analysis and retains a significant age-decade effect (HR 1.29 per decade, p = 0.001).
6.3 Bayesian Weibull AFT — parametric cross-check
The PH violation motivates a parametric AFT model that does not require proportional hazards. A Weibull AFT was fit in Stan (4 chains × 2,000 iterations, max R̂ = 1.003, min bulk ESS = 1,598):
variable | mean | median | sd | X2.5. | X97.5. | rhat | ess_bulk | ess_tail |
|---|---|---|---|---|---|---|---|---|
intercept | 9.140 | 9.130 | 0.2880 | 8.6100 | 9.7400 | 1 | 1,720 | 1,560 |
beta[1] | 0.342 | 0.345 | 0.1410 | 0.0616 | 0.6190 | 1 | 2,540 | 2,100 |
beta[2] | -0.158 | -0.158 | 0.0477 | -0.2540 | -0.0652 | 1 | 1,710 | 1,910 |
shape | 1.600 | 1.600 | 0.1060 | 1.4000 | 1.8100 | 1 | 2,210 | 2,440 |
time_ratio[1] | 1.420 | 1.410 | 0.2020 | 1.0600 | 1.8600 | 1 | 2,540 | 2,100 |
time_ratio[2] | 0.855 | 0.854 | 0.0407 | 0.7760 | 0.9370 | 1 | 1,710 | 1,910 |
When AFT time ratios are inverted (HR = 1 / time_ratio is the strict Weibull proportional-hazards correspondence, valid only under both PH and a Weibull baseline), the Cox and Bayesian estimates agree directionally but differ on the point-estimate scale: HR+ vs HR- 0.58 (Cox, 95% CI 0.40–0.84) vs 0.70 (Bayes 1/TR, 95% CrI 0.54–0.93). The gap is the expected behavior when PH is violated — Cox estimates a time-averaged hazard ratio while the AFT-derived HR holds only under the parametric assumption. Both methods agree that HR+ status is significantly protective and that each decade of age is significantly risk-amplifying. The 95% intervals overlap modestly, not extensively, so the agreement is informative rather than reassuring.
A posterior-predictive KM overlay confirms the Weibull fit visually:
R̂ histogram for all model parameters:
7. Discussion
The adaptive design’s value is operational, not statistical. Across non-null effects, peak power is 0.6–5.6 percentage points lower than the fixed design. The real benefit is enrollment savings under harmful and null scenarios: a 48% probability of stopping for futility under HR = 1.15 and a 36% probability under the null mean the adaptive design spares enrollment in roughly four out of ten futile trials — a clinically and ethically relevant outcome the fixed design cannot deliver.
The interim is event-driven at 30% information under H1. Triggering the interim when 12 observed events accumulate (≈ 30% of expected events under HR = 0.70) places the analysis inside the practical event-accrual window for an n = 120 / 24-month trial. A 50% information target was considered but, at the chosen sample size and event rate, almost never accumulated before end-of-study, reducing the design to “fixed with a near-dead futility check.” Sensitivity analysis across alternative information fractions is on the roadmap (see Limitations).
Cox PH and Bayesian Weibull AFT directionally agree on TCGA-BRCA. Both methods find HR+ status significantly protective and each decade of age significantly risk-amplifying. Point estimates differ on the HR scale (Cox 0.58 vs Bayes 1/TR 0.70 for HR+ vs HR-) because the strict HR = 1/TR correspondence holds only under both PH and Weibull baseline assumptions; Cox estimates a time-averaged HR while the parametric AFT does not. The agreement is therefore informative, demonstrating that conclusions are robust to the modeling family, but should not be over-interpreted as numerical concordance.
Bias in the adaptive HR estimator. Both designs produce biased log-HR estimates, but in opposite directions:
scenario | design | hr_true | bias log-HR | mean HR_est | true HR |
|---|---|---|---|---|---|
harmful | adaptive | 1.15 | +0.077 | 1.311 | 1.15 |
null | adaptive | 1.00 | +0.078 | 1.149 | 1.00 |
mild_effect | adaptive | 0.85 | +0.067 | 0.973 | 0.85 |
moderate_effect | adaptive | 0.75 | +0.052 | 0.851 | 0.75 |
strong_effect | adaptive | 0.65 | +0.034 | 0.727 | 0.65 |
very_strong_effect | adaptive | 0.55 | +0.013 | 0.604 | 0.55 |
harmful | fixed | 1.15 | -0.001 | 1.201 | 1.15 |
null | fixed | 1.00 | -0.004 | 1.044 | 1.00 |
mild_effect | fixed | 0.85 | -0.007 | 0.888 | 0.85 |
moderate_effect | fixed | 0.75 | -0.010 | 0.784 | 0.75 |
strong_effect | fixed | 0.65 | -0.013 | 0.680 | 0.65 |
very_strong_effect | fixed | 0.55 | -0.018 | 0.576 | 0.55 |
The fixed design shows the familiar small-sample Cox attenuation — log-HR estimates pulled toward 0 (HR toward 1), magnitude 0.01–0.03, direction independent of scenario. Standard, expected, and benign.
The adaptive design shows a larger positive bias in log-HR (0.03–0.09) that decreases as the true effect strengthens. Two mechanisms contribute:
- Futility-stop reporting. When a trial stops at interim, the reported HR is the posterior median from the interim Bayesian fit (Cox PH on the few-event interim data is unstable; see R/03 comments). The posterior is informed by a
N(0, 1)log-HR prior — moderately weak but non-trivial when only ~12 events have accumulated. The posterior median is therefore pulled toward HR = 1, regardless of the data’s true direction. Under harmful HR this drags the distribution of reported HRs toward 1 (away from the truth of 1.15); under benefit, futility rarely fires, so the contribution is small. - RAR allocation-imbalance under benefit. Post-interim randomization allocates more to the apparently-winning arm. Under a true benefit, this increases events in the treatment arm disproportionately to control, modestly inflating the Cox HR estimate vs the unbiased target. The effect is bounded by the 20/80 allocation caps but is still visible in the strong/very-strong scenarios.
The bias is small relative to the effect sizes being estimated — roughly 5% of log(HR_true) for the very-strong scenario, growing to a larger fraction under null/harmful where the truth is itself near 1. In a real submission this magnitude is reportable but not disqualifying; an IPTW-weighted sensitivity analysis would be the standard companion.
Regulatory framing. Per FDA Adaptive Designs for Clinical Trials of Drugs and Biologics (2019), §IV.A, an adaptive design submission needs (i) pre-specified rules, (ii) Type I error control demonstrated by simulation, (iii) bias quantified in the effect estimator. This report provides all three: Type I is 0.019 (below the 0.025 nominal), the OBF boundary is published in advance, and the table above is the bias characterization required by (iii).
8. Limitations and design choices
- Phase II screening design. Maximum n = 120 with 24-month follow-up is deliberately small for a Phase II go/no-go trial; rpact’s
getSampleSizeSurvivalsays n ≈ 791 would be needed for 80% power at HR = 0.70 under this alpha-spending. Power at smaller effect sizes (HR 0.75 / 0.85) is correspondingly modest. This is by design, not a misconfiguration — a confirmatory trial would scale up. - Futility threshold (P(HR < 0.7 | data) < 0.20) is operator-defined. A formal sensitivity analysis across alternative thresholds is on the roadmap.
- Cox PH on adaptive-trial data does not adjust for RAR-induced allocation imbalance. The empirical Type I (0.019) is below the 0.025 nominal, so this is not a regulatory dealbreaker, but in a real submission an IPTW-weighted sensitivity analysis would accompany the primary Cox PH.
- TCGA-BRCA is a toolkit validation, not a data-generating-model validation. Overall survival in breast cancer differs in endpoint, population, and hazard shape from the simulator’s hypothetical time-to-progression trial. The TCGA section demonstrates that the same Stan / KM / Cox / AFT pipeline works on real, messier data — not that the simulator’s exponential data-generating model matches breast cancer biology.
- Stan compilation in
testthat::test_diris fragile. Sourcing rstan-heavy files repeatedly in one R process triggered “parser failed badly” / C-stack errors; tests therefore run each file in its own Rscript subprocess (tests/testthat.R).
9. SAP excerpt
A standalone mock Statistical Analysis Plan section is in report/sap_section.qmd (rendered separately). It follows the standard ICH E9-aligned outline (objectives, estimand, hypotheses, sample size, primary analysis, missing-data handling, sensitivity analyses, safety).
10. Reproducibility
make sims # runs all 12,000 trial simulations (~100 s, 4 workers)
make tcga # fits KM, Cox, Bayes AFT on TCGA-BRCA (~30 s)
make report # renders this document and sap_section.qmd
make all # the lotRandom seeds are derived from CONFIG$simulation$seed = 20260513. CI runs a reduced (--n-sims 100) version of the pipeline on every push.
11. References
- ICH E9(R1) — Statistical Principles for Clinical Trials, Addendum on Estimands and Sensitivity Analyses, 2019.
- FDA — Adaptive Designs for Clinical Trials of Drugs and Biologics, Guidance for Industry, November 2019.
- O’Brien PC, Fleming TR — A multiple testing procedure for clinical trials. Biometrics 1979; 35: 549–556.
- Cox DR — Regression models and life-tables. J R Stat Soc B 1972; 34: 187–220.
- Wassmer G, Brannath W — Group Sequential and Confirmatory Adaptive Designs in Clinical Trials. Springer, 2016. (Background for
{rpact}.) - Stan Development Team — Stan Reference Manual, v2.32, 2023.
- TCGA-BRCA: The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 2012; 490: 61–70.