Mortality: Proportional hazards

A/E diagnostics are important but, if we have any mortality experience data, we should be using it to develop a model that takes account of that data, even if it’s nothing more than a simple how-much-heavier-or-lighter-is-the-mortality-of-this-population-than-average model. Otherwise, we’re not making full use of available information.

There are lots of possible approaches, including complex parametric formulas designed to capture all typically observed effects. But I promised concision and so in this article I’ll expound what I think is simultaneously one of the most powerful general approaches and one of the simplest. And the beauty of it is: we’ve already done most of the work.

Articles in this series

To make this concrete, let’s assume that we have a mortality model \(\mu(\beta)\), where \(\beta\) is a vector of real parameters¹ and that \(\mu(\beta)\) is differentiable by \(\beta\).

A standard way to estimate \(\beta\) is to choose the value that maximises the log-likelihood, i.e.

\[\hat\beta = \underset{\beta}{\arg\max}\, L(\beta)\tag{5}\]

where \(L\) is the log-likelihood and the parameter \(\beta\) is passed through to the mortality \(\mu(\beta)\).

We saw in the previous article that the log-likelihood is

\[L=\text{A}w\log\mu-\text{E}w\tag{4}\]

where \(w\ge0\) is an optional weighting variable.

We can try solving equation \((5)\) by setting the derivative of \(L\) to zero, resulting in the vector differential equation

\[\frac{\partial L}{\partial\beta}(\hat\beta) = 0\tag{6}\]

Recalling that there is an implicit \(\mu(\beta)\) in the \(\text{E}\) operator, we can re-express the derivative as

\[\frac{\partial L}{\partial\beta}=\text{A}wX-\text{E}wX\tag{7}\]

where

\[X=\frac{1}{\mu}\frac{\partial \mu}{\partial\beta}\tag{8}\]

The proportional hazards model

The form of equation \((7)\) begs the question: what if the vector \(X\) were an object in its own right, independent of \(\beta\), i.e. \(\partial X/ \partial\beta=0\)?

Equation \((8)\) would then be a simple first order differential equation in \(\mu\), for which we can write down the solution as

\[\mu(\beta) = \mu^\text{ref}\exp\Big(\beta^\text{T}X\Big)\tag{9}\]

where

\(\mu^\text{ref}\) is a mortality that does not depend on \(\beta\), and
\(\beta^\text{T}X\) means the ‘dot’ product of vectors \(\beta\) and \(X\), i.e. \(\sum_j \beta_j X_j\),

and don’t forget that \(\mu\), \(\mu^\text{ref}\) and the components of the covariate vector \(X\) are all variables and therefore also have implicit fact (\(i\)) and time (\(t\)) arguments.

Equation \((9)\) is the well-known proportional hazards model², with the elements of \(X\) being the covariates and the elements of \(\beta\) the fitted covariate weights.

Some observations:

Let’s first note that we were led to equation \((9)\) simply by writing the log-likelihood in terms of the \(\text{A}\) and \(\text{E}\) operators and the symmetry of equation \((7)\).
The name ‘proportional hazards’ is not ideal because hazard³ rates combine using addition, i.e. \(\mu=\sum_i\mu^{(i)}\), not multiplication, i.e. \(\mu=\prod_i e^{\beta_iX_i}\). So the covariates do not relate to individual hazard rates; instead they are component effects used to build a model. The ultimate justification is the effectiveness – power and tractability – of the proportional hazards approach.
Finally, this is part of a bigger picture in which linear \(\log\mu\)⁴ models are ubiquitous, from Gompertz, arguably the world’s first realistic mortality model, via Lee-Carter⁵ to the CMI Mortality Projections Model.

We still need to solve equation \((6)\), which, in general, we have to do numerically. The good news is that, provided we use a proportional hazards model, we can write down the first and second derivative of the log-likelihood in closed form. That in turn means we can use Newton–Raphson, which, in my experience is robust⁶ and beats most other numerical methods hands down⁷. The vector first derivative (from equation \((7)\) above) and the matrix second derivative are

\[\begin{aligned}
L'&=\text{A}wX-\text{E}wX
\\[1em]
L''&=-\text{E}wXX^\text{T}
\end{aligned}\]

where I have used \('\) to indicate \(\partial/\partial\beta\).

At risk of repetition, note (a) the concision and (b) that everything can be expressed in terms of \(\text{A}\) and \(\text{E}\), which means we’re re-using existing machinery for these calculations.

Information budget

Many expositions of the proportional hazards model do not include \(\mu^\text{ref}\), i.e. a given ‘background’ hazard rate, but for mortality analysis this is often optimal. For instance:

If you’re analysing DB pension plan mortality experience over the last ten years, you probably don’t want to be trying to calibrate a mortality trends model⁸ at the same time, which you can avoid by putting your pre-existing mortality trends assumption into \(\mu^\text{ref}\).
I’d suggest going further and modelling variation from a reasonable default mortality (including trends) so that you inherit a priori sensible behaviour⁹ from that default.
At the most extreme, if you have a postcode mortality model to hand, then use that as \(\mu^\text{ref}\)¹⁰.

In general, you want to spend the information budget provided by the experience data on fitting the unknowns you don’t know as opposed to spending it on refitting things you likely already do. The proportional hazards model makes this straightforward.

One model to rule them all

In my experience, the proportional hazards model is all you need in practice. The richness available from the infinite range of possible covariates, the sheer tractability of the approach, the straightforwardness of using a prior model and the interpretability of the results provide enough firepower to tackle any real world mortality modelling problem.

Insight 8. Proportional hazards models are probably all you need for mortality modelling

The proportional hazards model

\[\mu(\beta) = \mu^\text{ref}\exp\Big(\beta^\text{T}X\Big)\]

is

highly tractable, and
sufficiently powerful to cope with almost all practical mortality modelling problems.

[All mortality insights]

There is a lot more to this, e.g. how does mortality vary between populations, do we require additional procedures to select covariates initially, and so on, which are questions I hope to answer in due course.

But, for now, let’s take stock:

With the proportional hazards model, we have an excellent framework for creating mortality models.
And, by maximising log-likelihood, we can calibrate those models with relative ease.

Next article: Suddenly AIC

The obvious next question is: how should we choose between different models? This will be the subject of the next article.

Recall that, as a mortality, \(\mu(\beta)\) also has implicit fact (\(i\)) and time (\(t\)) arguments, but these are not shown to ease the notational burden. ↩
It is sometimes called the ‘Cox proportional hazards model’, but this is usually in a context (e.g. medical research) where the objective is to estimate an impact independently from the absolute mortality (or ‘hazard’) rate, which is the opposite of what we’re doing here – we are trying to calculate the absolute hazard rate. ↩
‘Hazard’ is the general term for what actuaries often call a ‘decrement’. ↩
When you think about it, it is a little odd that \(\log\mu\) should be the natural metric when \(\mu\) itself is already the log of something. There is a very good explanation for this, but that’s to come. ↩
OK, Lee-Carter is bi-linear. ↩
Some initial damping may be required. ↩
First, it converges quadratically, and, second, the cost of inverting \(L''\) (sometimes cited as a reason not to use Newton-Raphson) is typically very small compared with the cost of calculating \(L'\) and \(L''\). Failure to converge is often because the model being calibrated has an identifiability issue, resulting in \(L''\) being non-invertible, which in itself is a handy diagnostic. ↩
Mortality trends are hard to discern over a period as short as ten years, and breaking them down into age, period and cohort components is even harder. The CMI uses 41 years for its Mortality Projections Model and it still struggles. Trends also typically include some element of judgement or consensus view, to which you likely want to adhere in your general mortality modelling. ↩
An important example is high age mortality, which is nigh impossible to determine from an individual pension plan’s experience data. By the very nature of high ages, there is little data and that data is often unreliable – even if the plan is huge, systemic data risk remains. So, in practice, you will need some sort of prior model for high age mortality.

This is one of the reasons I set up the CMI’s High Age Mortality Working Party. The high quality – and prize-winning – reports (WP100 and WP122) by Steve Bale and his working party colleagues are recommended reading.

Incidentally, ‘high age’ in a pension plan longevity context typically means over age 90 or so. ↩
In this case, you’ll also need a means of determining the weight to place on the experience compared with your postcode model, but that’s a separate subject. ↩