Viterbi Algorithm, Part 2: Decoding

Aug 20, 2020 10 min read

This is my second post describing the Viterbi algorithm. As before, our presentation follows Jurafsky and Martin closely, merely filling in some details omitted in the text. (See here for the first post.)

As a quick refresher, we observe a sequence of words or symbols $w^{(1)}, w^{(2)}, \ldots, w^{(m)},$ assumed to belong to a vocabulary $V = \{ w_1, w_2, \ldots, w_{|V|} \}.$ Corresponding to these observations are a sequence of (unobserved) states $q^{(1)}, \ldots, q^{(m)},$ assumed to belong to a set $Q =\{q_1, \ldots, q_N \}.$ In addition, there are two special states, $q_0$ and $q_F,$ representing the beginning and end of the sequence. For convenience, we will use $w_i^j$ to denote the word sequence $w^{(i)} \cdots w^{(j)}$ and $q_i^j$ to denote the state sequence $q^{(i)} \cdots q^{(j)}.$

A Hidden Markov Model (HMM) is characterized by a set of transition probabilities $A$ and a set of emission probabilities $B.$ The transition probabilities include $a_{ij},$ the probability of transitioning from state $q_i$ to $q_j;$ $a_{0j},$ the probability of transitioning from $q_0$ to $q_j;$ and $a_{iF},$ the probability of transitioning from $q_i$ to $q_F.$ The emission probabilities are denoted $b_i(w_k),$ the probability of emitting word $w_k$ from state $q_i.$ In what follows, when we say “given the HMM”, what we mean is “given the transition and emission probabilities”.

Decoding

As in the last post, we will make the following assumptions:

Word Independence: $P\{ w_1^m \; | \; q_1^m \} = \Pi_{t=1}^m P\{ w^{(t)} \; | \; q^{(t)} \}.$ This assumption is two-fold: it says first that words $w^{(i)}$ and $w^{(j)}$ are conditionally independent given $q_1^m,$ and second that $w^{(t)}$ is conditionally independent of $q^{(i)}$ given $q^{(t)},$ $i \neq t.$
Markov Assumption: $P\{ q_1^m \; | \; A, B \} = \Pi_{t=1}^m P \{ q^{(t)} \; | \; q^{(t-1)} \}.$ This says that a state is conditionally independent of earlier states given the preceding state.
Known Transition and Emission Probabilities: the notation makes it clear, but it’s worth emphasizing that in this section we will assume the HMM is known. In practice this would not be the case! Our next post will address the learning problem.

The most likely sequence of states is $$ \hat{q}_0^m = \operatorname{argmax}_{q_0^m} P \{ q_0^m \; | \; w_1^m, A, B \}. $$

Applying the definition of conditional probability, $$ \begin{align} P \{ q_0^m \; | \; w_1^m, A, B \} = \frac{P \{ w_1^m, q_0^m \; | \; A, B\} }{P \{ w_1^m \; | \; A, B \}}, \end{align} $$ and since the denominator does not depend on the optimization variable, we can pull it out. Therefore, $$ \hat{q}_0^m = \operatorname{argmax}_{q_0^m} P \{ w_1^m, q_0^m \; | \; A, B \}. $$

Solving this optimization problem is in principle straightforward: try all possible sequences, calculate the probability of each one, and keep track of which one is best. This involves examining $N^m$ sequences, which is not palatable.

The Value Function

Instead, we will use a dynamic programming algorithm. The central quantity is $$ \nu_t(j) = \max_{q_0^{t-1}} P \{ w_1^t, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \}, $$ which we will call the value function. It is not immediately obvious why we would care about this! Before we explain what we’ll do with it, we will show that it is possible to compute the value function recursively.

But before we do that, first we see that: $$ \begin{align} \nu_1(j) &= P \{ w^{(1)}, q^{(0)}, q^{(1)} = q_j \; | \; A, B \} \\ &= P \{ w^{(1)} \; | \; q^{(1)} = q_j, A, B \} \cdot P \{ q^{(1)} = q_j \; | \; q^{(0)}, A, B \} \\ &= b_j \left( w^{(1)} \right) \cdot a_{0j}. \end{align} $$

Now we will show that $$ \nu_t(j) = \max_i \left\{ b_j \left( w^{(t)} \right) \cdot a_{ij} \cdot \nu_{t-1}(i) \right \}, \quad t > 1. $$

Starting from the definition: $$ \begin{align} \nu_t(j) &= \max_{q_0^{t-1}} P \{ w_1^t, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \} \\ &= \max_{q_0^{t-1}} P \{ w^{(t)} \; | \; w_1^{t-1}, q_0^{t-1}, q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{30pt} \cdot P \{ w_1^{t-1}, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \} \\ &= \max_{q_0^{t-1}} P \{ w^{(t)} \; | \; q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{30pt} \cdot P \{ w_1^{t-1}, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \} \\ &= \max_{q_0^{t-1}} \left \{ b_j \left( w^{(t)} \right) \cdot P \{ w_1^{t-1}, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \} \right\}, \end{align} $$ where we applied the definition of conditional probability to get the second line, followed by the Word Independence assumption, and lastly the definition of the word emission probability.

Next, we’ll pull the word emission probability out of the max operator since it does not depend on the optimization variables: $$ \begin{align} \nu_t(j) &= b_j \left( w^{(t)} \right) \cdot \max_{q_0^{t-1}} P \{ w_1^{t-1}, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \} \\ &= b_j \left( w^{(t)} \right) \cdot \max_{q_0^{t-2}, q_i} P \{ w_1^{t-1}, q_0^{t-2}, q^{(t-1)} = q_i, q^{(t)} = q_j \; | \; A, B \} \\ &= b_j \left( w^{(t)} \right) \cdot \max_{q_0^{t-2}, q_i} P \{ q^{(t)} = q_j \; | \; w_1^{t-1}, q_0^{t-2}, q^{(t-1)} = q_i, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{ w_1^{t-1}, q_0^{t-2}, q^{(t-1)} = q_i \; | \; A, B \} \\ &= b_j \left( w^{(t)} \right) \cdot \max_{q_0^{t-2}, q_i} P \{ q^{(t)} = q_j \; | \; q^{(t-1)} = q_i, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{ w_1^{t-1}, q_0^{t-2}, q^{(t-1)} = q_i \; | \; A, B \} \\ &= b_j \left( w^{(t)} \right) \cdot \max_{q_0^{t-2}, q_i} \left\{ a_{ij} \cdot P \{ w_1^{t-1}, q_0^{t-2}, q^{(t-1)} = q_i \; | \; A, B \} \right \}, \\ \end{align} $$ where we applied the definition of conditional probability followed by the Word Independence and Markov assumptions, and finally the definition of the transition probabilities.

Finally, we will use the property that optimization problems can be split up to complete the derivation: $$ \begin{align} \nu_t(j) &= b_j \left( w^{(t)} \right) \cdot \max_{q_i} a_{ij} \cdot \max_{q_0^{t-2}} P \{ w_1^{t-1}, q_0^{t-2}, q^{(t-1)} = q_i \; | \; A, B \} \\ &= b_j \left( w^{(t)} \right) \cdot \max_{i} \left\{ a_{ij} \cdot \nu_{t-1}(i) \right\}. \end{align} $$

This recursive property allows us to calculate the value function for all $t$ and $j.$

The Optimal State Sequence

In deriving the decoding algorithm, we will need the following property of optimization problems. Let $(\hat{x}, \hat{y}) = \operatorname{argmax}_{(x, y)} f(x, y).$ Let $g(y) = f(\hat{x}, y).$ Then $\hat{y} = \operatorname{argmax}_y g(y).$ The key point here is that if we (somehow) know the optimal value of one of the joint optimization values, we can substitute it in the objective, optimize over the remaining variables, and we’ll get the same answer as if we optimized jointly. Splitting up optimization problems works the other way, too: let $h(x) = \max_y f(x, y).$ Then $\hat{x} = \operatorname{argmax}_x h(x).$

Recall that $\hat{q}_0^m = \operatorname{argmax}_{q_0^m} P \{ w_1^m, q_0^m \; | \; A, B \}.$

If we let $h(q_j) = \max_{q_0^{m-1}} P \{ w_1^m, q_0^{m-1}, q^{(m)} = q_j \; | \; A, B \} = \nu_m(j)$, then we see that $\hat{q}^{(m)} = \operatorname{argmax}_j \nu_m(j)$.

Now suppose that we knew $\hat{q}_{t+1}^m$, somehow. We will show that we can calculate $\hat{q}^{(t)}$. Since we do know $\hat{q}^{(m)}$, that will allow us to calculate the remaining states recursively.

First, let $g(q_0^t) = P \{ w_1^m, q_0^t, q_{t+1}^m = \hat{q}_{t+1}^m \; | \; A, B \}.$ Then $\hat{q}_0^t = \operatorname{argmax}_{q_0^t} g(q_0^t),$ and $$ \begin{align} \hat{q}^{(t)} &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_1^m, q_0^{t-1}, q^{(t)} = q_j, q_{t+1}^m = \hat{q}_{t+1}^m \; | \; A, B \} \\ &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_{t+1}^m \; | \; w_1^{t}, q_0^{t-1}, q^{(t)} = q_j, q_{t+1}^m = \hat{q}_{t+1}^m, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{ w_1^{t}, q_0^{t-1}, q^{(t)} = q_j, q_{t+1}^m = \hat{q}_{t+1}^m \; | \; A, B \} \\ &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_{t+1}^m \; | \; q_{t+1}^m = \hat{q}_{t+1}^m, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{ w_1^{t}, q_0^{t-1}, q^{(t)} = q_j, q_{t+1}^m = \hat{q}_{t+1}^m \; | \; A, B \} \\ &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_1^{t}, q_0^{t-1}, q^{(t)} = q_j, q_{t+1}^m = \hat{q}_{t+1}^m \; | \; A, B \}, \end{align} $$ where we applied the definition of condition probability, then Word Independence to simplify terms, and finally removing terms not involving the optimization variables.

Next we apply the the definition of conditional probability, followed by the Markov Assumption, and the definition of the value function, finding: $$ \begin{align} \hat{q}^{(t)} &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_1^{t} \; | \; q_0^{t-1}, q^{(t)} = q_j, q_{t+1}^m = \hat{q}_{t+1}^m, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{q_0^{t-1}, q_{t+1}^m = \hat{q}_{t+1}^m \; | \; q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{q^{(t)} = q_j \; | \; A, B \} \\ &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_1^{t} \; | \; q_0^{t-1}, q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{q_0^{t-1} \; | \; q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{q_{t+1}^m = \hat{q}_{t+1}^m \; | \; q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{q^{(t)} = q_j \; | \; A, B \} \\ &= \operatorname{argmax}_{q_j} \max_{q_0^{t-1}} P \{ w_1^{t}, q_0^{t-1}, q^{(t)} = q_j \; | \; A, B \} \\ &\phantom{=} \hspace{70pt} \cdot P \{q_{t+1}^m = \hat{q}_{t+1}^m \; | \; q^{(t)} = q_j, A, B \} \\ &= \operatorname{argmax}_{q_j} \nu_t(j) \cdot P \{q_{t+1}^m = \hat{q}_{t+1}^m \; | \; q^{(t)} = q_j, A, B \}. \end{align} $$

Finally, recall that we know $q_{t+1}^m$. So suppose $q^{(t+1)} = q_{k_{t+1}}$. Then: $$ \begin{align} \hat{q}^{(t)} &= \operatorname{argmax}_{q_j} \nu_t(j) \cdot P \{q_{t+1}^m = \hat{q}_{t+1}^m \; | \; q^{(t)} = q_j, A, B \} \\ &= \operatorname{argmax}_{q_j} \nu_t(j) \cdot P \{q^{(t+1)} = \hat{q}_{k_{t+1}} \; | \; q^{(t)} = q_j, A, B \} \\ &\phantom{=} \hspace{50pt} P \{q_{t+2}^m = \hat{q}_{t+2}^m \; | \; q^{(t+1)} = q_{k_{t+1}}, A, B \} \\ &= \operatorname{argmax}_j \nu_t(j) \cdot a_{jk_{t+1}}. \end{align} $$

Thus, once we have calculated the value function, we can compute the optimal state sequence, starting from the end and working our way backwards.

Putting it all Together

The python code below calculates the value function and uses it to determine the optimal state sequence. As an implementation note, we work with the logarithm of the value function to avoid numeric underflow.

def viterbi(w1m, a, a0, b):
    """
    Parameters
    ----------
     w1m : list of length m
         Sequence of word indices; wlm[t] is the index of the t-th word
     a : N x N array
         State transition probabilities
     a0 : N x 1 array
         Initial transition probabilities
     b : N x |V| array
         Word emission probabilities

    Returns
    -------
     q_hat : N x 1 array
         The optimal state sequence
    """
    m = len(w1m)
    N = len(a0)

    # Calculate log(nu)
    log_nu = np.zeros((m, N))
    for j in range(N):
        log_nu[0, j] = np.log(b[j, w1m[0]]) + np.log(a0[j])

    for t in range(1, m):
        for j in range(N):
            for i in range(N):
               nu_candidate = (
                   np.log(b[j, wlm[t]])
                   + np.log(a[i, j])
                   + log_nu[t-1, i]
               )
               if i == 0 or nu_candidate > log_nu[t, j]:
                   log_nu[t, j] = nu_candidate

    # Calculate q_hat
    q_hat = [None] * m
    best_qval = None
    for j in range(N):
        if j == 0 or log_nu[m-1, j] > best_qval:
            q_hat[m-1] = j
            best_qval = log_nu[m-1, j]

    for t in range((m-2), -1, -1):
        best_qval = None
        for j in range(N):
            qval_candidate = log_nu[t, j] + np.log(a[j, q_hat[t+1]])
            if j == 0 or qval_candidate > best_qval:
                q_hat[t] = j
                best_qval = qval_candidate

    return q_hat

Nlp

Bob Wilson

Marketing Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.