Algorithm Overview
COMBSS (Continuous Optimisation for Best Subset Selection) reformulates the discrete combinatorial problem of selecting the best \(k\) predictors as a continuous optimisation over the hypercube \([0,1]^p\) via a Boolean relaxation.
Frank-Wolfe method (method=’fw’)
The Frank-Wolfe homotopy algorithm (Algorithm 1 in the paper) operates as follows:
Initialise \(t = (k/p, \ldots, k/p)\) — centroid of the \(k\)-sparse simplex \(T_k\).
For \(i = 1, \ldots, N\) (homotopy loop):
Set \(\delta_i\) on a geometric schedule from \(\delta_{\min}\) to \(\delta_{\max}\).
Solve the ridge-penalised GLM inner problem at current \(t\).
Compute the Danskin gradient — no Hessian required.
Find the Frank-Wolfe vertex: \(s = \arg\min_{s \in T_k} \langle \nabla f(t), s \rangle\) (the \(k\) smallest gradient components).
Update: \(t \leftarrow (1 - \alpha) t + \alpha s\).
Select the \(k\) features with the largest final \(t\) values.
Key properties:
Gradient via Danskin’s envelope theorem: requires only a single ridge-penalised GLM solve per iteration — no Hessian computation.
Auto-calibrated penalty schedule: \(\delta_{\min}\) and \(\delta_{\max}\) are set from \(\lambda_{\max}(X^T X)\) estimated by power iteration, so no manual tuning is needed.
Scalable to \(p \gg n\) via the Woodbury identity (linear) or warm-started L-BFGS-B (logistic, multinomial).
Inner solvers by model type
Linear regression: closed-form solution with Woodbury identity when \(p > n\) for \(O(n^2 p)\) complexity instead of \(O(p^3)\).
Binary logistic regression: sklearn
LogisticRegressionwith L-BFGS-B via feature scaling to convert weighted ridge into standard ridge.Multinomial logistic regression: sklearn multinomial L-BFGS-B with baseline-category parameterisation. Gradient uses all \(C\) class coefficients to match the symmetric objective.
Original method: Adam with dynamic lambda grid
The original COMBSS method for linear regression (Moka et al. 2024) uses:
Adam optimiser to minimise the COMBSS objective function for a fixed penalty parameter \(\lambda\).
Dynamic lambda grid: starting from \(\lambda_{\max} = y^T y / n\), \(\lambda\) is halved until a subset of size \(\geq q\) is found. A second pass refines the grid by bisecting between adjacent lambdas that produced different subset sizes.
Validation: the best subset is selected from the grid by minimising MSE on a held-out validation set.
Note on the two lambdas
The parameter called “lambda” has different meanings in the two methods:
Original method: \(\lambda\) is the sparsity penalty in the COMBSS objective. A grid of \(\lambda\) values is searched, and each \(\lambda\) yields a different subset.
Frank-Wolfe method:
lam_ridgeis a ridge regularisation penalty on the coefficients in the inner solver. Sparsity is controlled by \(k\) (the model size), not by lambda. This parameter is typically 0 or small.
Intercept handling
In the Frank-Wolfe method, the intercept is handled internally and is not subject to selection. An intercept column is prepended automatically by the model classes.
In the original method, the intercept (if included) is treated the same as other features.