Algorithm Overview
==================

COMBSS (Continuous Optimisation for Best Subset Selection) reformulates the
discrete combinatorial problem of selecting the best :math:`k` predictors as a
continuous optimisation over the hypercube :math:`[0,1]^p` via a Boolean
relaxation.

Frank-Wolfe method (method='fw')
---------------------------------

The Frank-Wolfe homotopy algorithm (Algorithm 1 in the paper) operates as
follows:

1. **Initialise** :math:`t = (k/p, \ldots, k/p)` --- centroid of the
   :math:`k`-sparse simplex :math:`T_k`.

2. **For** :math:`i = 1, \ldots, N` (homotopy loop):

   - Set :math:`\delta_i` on a geometric schedule from :math:`\delta_{\min}`
     to :math:`\delta_{\max}`.
   - Solve the ridge-penalised GLM inner problem at current :math:`t`.
   - Compute the Danskin gradient --- no Hessian required.
   - Find the Frank-Wolfe vertex: :math:`s = \arg\min_{s \in T_k}
     \langle \nabla f(t), s \rangle` (the :math:`k` smallest gradient
     components).
   - Update: :math:`t \leftarrow (1 - \alpha) t + \alpha s`.

3. **Select** the :math:`k` features with the largest final :math:`t` values.

Key properties:

- **Gradient via Danskin's envelope theorem**: requires only a single
  ridge-penalised GLM solve per iteration --- no Hessian computation.
- **Auto-calibrated penalty schedule**: :math:`\delta_{\min}` and
  :math:`\delta_{\max}` are set from :math:`\lambda_{\max}(X^T X)` estimated
  by power iteration, so no manual tuning is needed.
- **Scalable to** :math:`p \gg n` via the Woodbury identity (linear) or
  warm-started L-BFGS-B (logistic, multinomial).

Inner solvers by model type
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Linear regression**: closed-form solution with Woodbury identity when
  :math:`p > n` for :math:`O(n^2 p)` complexity instead of :math:`O(p^3)`.
- **Binary logistic regression**: sklearn ``LogisticRegression`` with L-BFGS-B
  via feature scaling to convert weighted ridge into standard ridge.
- **Multinomial logistic regression**: sklearn multinomial L-BFGS-B with
  baseline-category parameterisation.  Gradient uses all :math:`C` class
  coefficients to match the symmetric objective.

Original method: Adam with dynamic lambda grid
-----------------------------------------------

The original COMBSS method for linear regression (Moka et al. 2024) uses:

1. **Adam optimiser** to minimise the COMBSS objective function for a fixed
   penalty parameter :math:`\lambda`.
2. **Dynamic lambda grid**: starting from
   :math:`\lambda_{\max} = y^T y / n`, :math:`\lambda` is halved until a
   subset of size :math:`\geq q` is found.  A second pass refines the grid by
   bisecting between adjacent lambdas that produced different subset sizes.
3. **Validation**: the best subset is selected from the grid by minimising MSE
   on a held-out validation set.

Note on the two lambdas
^^^^^^^^^^^^^^^^^^^^^^^^

The parameter called "lambda" has different meanings in the two methods:

- **Original method**: :math:`\lambda` is the **sparsity penalty** in the
  COMBSS objective.  A grid of :math:`\lambda` values is searched, and each
  :math:`\lambda` yields a different subset.
- **Frank-Wolfe method**: ``lam_ridge`` is a **ridge regularisation penalty** on the
  coefficients in the inner solver.  Sparsity is controlled by :math:`k`
  (the model size), not by lambda.  This parameter is typically 0 or small.

Intercept handling
------------------

In the Frank-Wolfe method, the intercept is handled internally and is **not** subject
to selection.  An intercept column is prepended automatically by the model
classes.

In the original method, the intercept (if included) is treated the same as
other features.