COM­P4804: Al­go­rithms II
$\new­com­mand{\R}{\mathbb{R}}\De­clar­e­Math­Op­er­a­tor{\E}{\mathbf{E}}\De­clar­e­Math­Op­er­a­tor{\deg}{deg}$
Fin­ger­print­ing

Here we look at a fam­ily of al­go­rith­mic tech­niques col­lec­tively called fin­ger­print­ing.

Freivalds' Al­go­rithm

Re­mem­ber the ob­vi­ous $\Theta(n^3)$ al­go­rithm for mul­ti­ply­ing two $n\times n$ ma­tri­ces, $A$ and $B$:

Multiply(A, B, n)
  for i = 1 to n
    for j = 1 to n
      C[i][j] = 0
      for k = 1 to n
        C[i][j] = C[i][j] + A[i][k]*B[k][j]

Sup­pose, in­stead that some­one gives us three $n\times n$ ma­tri­ces $A$, $B$, and $C$ and claims that \[ A \times B = C \en­space. \] How can we ver­ify that this is true?

The first thing to re­mem­ber is that, for fixed $a,b\in\R$, $a\neq 0$, the equa­tion $ax=b$ has ex­actly one so­lu­tion for $x$ (namely $x=b/a$).

a non-trivial linear equation in one variable has one solution

Re­call that the dot prod­uct of two vec­tors $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$ is \[ v\cdot w = \sum_{i=1}^n v_i\times w_i \en­space . \]

Fin­ger­print Lemma: Let $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$ be two real-val­ued vec­tors with $v\neq w$, and let $r=(r_1,\ldots,r_n)$ be a vec­tor whose val­ues are in­te­gers cho­sen uni­formly and in­de­pen­dently at ran­dom from ${1,\ldots,K}$. Then \[ \Pr\{r\cdot v = r\cdot w\} \le 1/K \en­space . \]

Proof: Since $v\neq w$, there is at least one index $k$ such that $v_i\neq w_i$. Now, rewrite $r\cdot v = r\cdot w$ as \[ r_i(v_i-w_i) = \sum_{j\neq i} r_j(w_j - v_j) \en­space . \] With every­thing on the right-hand-side fixed, there is only one value of $r_i$ that sat­is­fies this equa­tion. namely \[ r_i' = \frac{\sum_{j\neq i} r_j(w_j - v_j)}{v_i-w_i} \] But $r_i$ is cho­sen ran­domly from among $K$ dif­fer­ent val­ues, so the prob­a­bil­ity that $r_i=r_i'$ is at most $1/K$. ∎

We can think of $r\cdot v$ and $r\cdot w$ as the fin­ger­prints of $v$ and $w$; they are small quan­ti­ties (a sin­gle num­ber) that is al­most surely dif­fer­ent if $v$ and $w$ are dif­fer­ent.

Re­mem­ber that we can mul­ti­ply an $n$-vec­tor $r$ by and $n\times n$ ma­trix $A$ in $O(n^2)$ time. You can read the code right off the for­mula for vec­tor-ma­trix ma­trix-mul­ti­pli­ca­tion: If $x = rA$, then \[ x_i = \sum_{j=1}^n r_j A_{i,j} \]

for i = 1 to n
  x[i] = 0
  for j = 1 to n
    x[i] = x[i] + r[j]*A[i][j]

The re­sult of this mul­ti­pli­ca­tion is an­other $n$-vec­tor. That means we can com­pute $r\times A\times B$ in $O(n^2)$ time; first com­pute $v=rA$ and then com­pute $vB=rAB$. We can also com­pute $rC$ in $O(n^2)$ time. Clearly, if $A\times B = C$, then $r\times A \times B = r\times C$.

Lemma: If $A\times B\neq C$, and a $r$ is cho­sen as above, then \[ \Pr\{rAB = rC\} \le 1/K \]

Proof: Let $X=AB$. Then, since $X\neq C$, there must be some in­dices $i$ and $j$ such that $X_{i,j} \neq C_{i,j}$. In par­tic­u­lar, col­umn $j$ of $X$ is not equal to col­umn $j$ of $C$. But $(rX)_i$ is just the dot prod­uct of $r$ and col­umn $j$ of $X$ and $(rC)_i$ is the dot prod­uct of col­umn $j$ of $C$ with $r$. There­fore, by the fin­ger­print lemma $\Pr\{(rAB)_i = (rC)_i\} \le 1/K$. ∎

To sum­ma­rize, we have an al­go­rithm (Freivald's Al­go­rithm) that runs in $\Theta(n^2)$ time and, given three $n\times n$ ma­tri­ces $A$, $B$, and $C$ will:

String Match­ing

Sup­pose we have a lit­tle string $p=p_1,\ldots,p_{m}$ and a big body of text $t=t_1,\ldots,t_{n}$. Here $m<n$ and we want to look for oc­cur­rences of $p$ in $t$. More pre­cisely,

We want to find the first match or all matches. The ob­vi­ous al­go­rithm looks like this:

for i = 1 to n-m
    j = 1
    while j <= m and p[j] = t[i+j-1]
        j = j + 1
    if j > m then output i

This al­go­rithm out­puts all matches and runs in $O(nm)$ time, and on some in­stances it runs in $\Omega(nm)$ time, even when there are no matches to out­put. This hap­pens, for ex­am­ple, when $p=aaaa\ldots aab$ and $t=aaaa\ldots a$.

Kalai's al­go­rithm

Note that we can think of $p$ and $t$ as se­quences of in­te­gers in the range ${0,\ldots,k-1}$ where $k$ is the al­pha­bet size. No­tice that, if we have a ran­dom vec­tor $r\in\{1,\ldots,K\}^m$ of length $m$ then, by the Fin­ger­print Lemma, if $i$ is not a match, then \[ \Pr\{ r\cdot p = r\cdot (t_{i},\ldots,t_{i+m-1}) \} \le 1/K \en­space . \]

We can com­pute $r\cdot p$ in $O(m)$ time. Now, if we can com­pute $r\cdot (t_{i},\ldots,t_{i+m-1})$ for all $i\in\{1,\ldots,n-m\}$, then we have an ef­fi­cient string match­ing al­go­rithm. This kind of cal­cu­la­tion is called the con­vo­lu­tion of $r$ and $t$. That is, we want the con­vo­lu­tion vec­tor: \[ \begin{array}{rcl}r_1t_1&+\cdots+&r_mt_m\\ r_1t_2&+\cdots+&r_mt_{m+1}, \\ r_1t_3&+\cdots+&r_mt_{m+2},\\ \vdots &\ddots & \vdots \\ r_1t_{n-m}&+\cdots+&r_mt_{n-1},\\ r_1t_{n-m+1}&+\cdots+&r_mt_{n}\end{array} \] We can be com­pute this con­vo­lu­tion vec­tor in $O(n\log n)$ time using the Fast-Fourier Trans­form (which we may dis­cuss later in the course.)

This means that we have an al­go­rithm that runs in $O(n\log n)$ time and re­turns a list of in­te­gers $L$ such that:

  1. If $i$ is a match, then $i$ is in $L$.
  2. If $i$ is not a match, then $\Pr\{i\in L\} \le 1/K$.

That's al­ready im­pres­sive, what even more im­pres­sive is that this al­go­rithm can also eas­ily han­dle wild­cards; these are lo­ca­tions in $p$ that match any char­ac­ter. For ex­am­ple, we may have a pat­tern like "p?st" that matches "past", "post", and "pest". This ex­ten­sion is triv­ial: At any index where the pat­tern has a wild­card, we just set the cor­re­spond­ing value in $r$ to zero. Then this char­ac­ter will not con­tribute to the fin­ger­print $r\cdot p$ or to $r\cdot (t_i,\ldots,t_{i+m-1})$.

The­o­rem: Given a pat­tern string $p$ of length $m$, pos­si­bly with wild­cards and a text string, $t$ of length $n$, there is an al­go­rithm that runs in $O(n\log n)$ time and re­turns a list of in­te­gers $L$ such that 1. If $i$ is a match, then $i$ is in $L$. 2. If $i$ is not a match, then $\Pr\{i\in L\} \le 1/K$.

Of course, if we take $K \ge m$, then we can even ver­ify each value of $i$ in $O(m)$ time and to see if it re­ally is a match. The ex­pected amount of time we will spend on non-matches is only $O(n)$. [Ex­er­cise: write this down care­fully.]

The Rabin-Karp Al­go­rithm

Here are my orig­i­nal hand-writ­ten notes on this topic. Here's the al­go­rithm im­ple­mented in C.

Now we can think of $p$ and $t$ as se­quences of in­te­gers in the range ${0,\ldots,k-1}$ where $k$ is the al­pha­bet size. We can com­press $p$ down to think of $p$ as a sin­gle in­te­ger \[ I(p) = \sum_{j=0}^{m-1} k^{m-j-1}\times p_j \] Just think of $p$ as a base-$k$ rep­re­sen­ta­tion of some in­te­ger. For ex­am­ple if our al­pha­bet is $0,1,2,3,4,5,6,7,8,9$ and $p="10393"$, then $I(p)=10393$. We can com­pute $I(p)$ in $O(m)$ time:

x = 1
I = 0
for j = 1 to m
    I += p[m-j]*x
    x = x*k

We can re­state our prob­lem as that of try­ing to find the in­dices $i$ where \[ I(p) = I(t_i\ldots,t_{i+m-1}) \en­space . \]

The nice thing about this is that we only need to com­pute $I(p)$ once and we have a nice re­la­tion be­tween con­sec­u­tive strings in $t$: \[ I(t_{i+1},\ldots,t_{i+m}) = kI(t_i,\ldots,t_{i+m-1}) - t_{i}k^m + t_{i+m} \]

Ip = I(p)
Ir = I(t[0...m-1])
km = k to the power m
for i = 0 to n-m
    if Ip == Ir then output i
    Ir = k*Ir - t[i]*km + t[i+m]

Great! This looks like a lin­ear time al­go­rithm. There's only one prob­lem; the val­ues of Ip and Ir are huge. We can't store them in a sin­gle ma­chine reg­is­ter, so those mul­ti­pli­ca­tions and ad­di­tions each take (at least) $\Omega(m)$ time. We need to make things smaller still.

Num­ber The­ory to the Res­cue

To do this, we're going to pick a prime num­ber $P$ and do all our arith­metic mod­ulo $P$, so all the num­bers we deal with are in the range ${0,\ldots,P-1}$. If $P$ is small enough to fit into a reg­is­ter, then the al­go­rithm def­i­nitely runs in $O(n)$ time, but is it cor­rect? Not al­ways: just be­cause \[ I(p)\bmod P = I(t_i,\ldots,t_{i+m-1}) \bmod P \] doesn't mean $I(P)=I(t_i,\ldots,t_{i+m-1})$.

We're going to make sure that it's prob­a­bly works most of the time. The state­ment \[ x \bmod P = y \bmod P \] re­ally means $x=aP + b$ and $y=a'P + b$, for some in­te­gers $a$, $a'$ and $b$. This means \[ x-y = (a-a')P \] So we have to worry about the case where $x-y$ is a mul­ti­ple of $P$. This only hap­pens when $P$ is a prime fac­tor of $|x-y|$. How many prime fac­tors does $x-y$ have? Each prime fac­tor is at least $2$, so if $|x-y|$ has $q$ prime fac­tors, then $|x-y| \ge 2^q$. This means $q \le \log_2|x-y|$. So $|x-y|$ has at most $\log_2|x-y|$ prime fac­tors.

Now our $x$'s and $y$'s are in­te­gers in the range $\{0,\ldots,k^m\}$, so there are only $m\log_2 k$ prime fac­tors we have to worry about. Now we need a re­sult from num­ber the­ory:

The­o­rem: (Prime Num­ber The­o­rem) The num­ber of primes less than $N$ is about $N/\ln N$.

Hey, that a lot of primes! We can use this:

Lemma: If $P$ is a prime num­ber cho­sen uni­formly at ran­dom from the set of all primes less than $N$ and $I(p) \neq I(t_{i},\ldots,t_{i+m-1})$ , then \[ \Pr\{I(p)\bmod P = I(t_{i},\ldots,t_{i+m-1})\bmod P \} \le \frac{m\ln N\log_2 k}{N} \]

Proof: We've al­ready given it. We have $N/\ln N$ choices for $P$ and there are at most $m\log_2 k$ that cause $I(p)\bmod P = I(t_{i},\ldots,t_{i+m-1})\bmod P$. ∎

How big do we need $N$ to be? If we take $N > 2Q m\log m\log k$, then the prob­a­bil­ity in the lemma be­comes at most $1/Q$. Usu­ally, we would take $N$ to be about as big as we could com­fort­ably fit into a ma­chine word $N=2^{32}$ or $N=2^{64}$. Here's the whole thing in C.

Miller-Rabin Pri­mal­ity Test­ing

If we want to test if a num­ber $P$ is prime, we can use a bit of num­ber the­ory. Write $P-1$ as $2^s d$ for in­te­ger $s$ and odd $d$. Now, if $P$ is prime, then for every $a\in\{1,\ldots,P-1\}$,

On the other hand, if $P$ is not prime and odd, then for at least three-quar­ters of the val­ues of $a\in\{1,\ldots,P-1\}$,

So the test is easy. We pick a ran­dom $a$ and check if the first or sec­ond case holds. In the case when $P$ is not prime, we will cor­rectly de­tect it with prob­a­bil­ity at least $3/4$. If we want to be more cer­tain, then we can re­peat the ex­per­i­ment $k$ times and we will cor­rectly de­tect com­pos­ites with prob­a­bil­ity at least $1-4^{-k}$ using al­go­rithm that re­quires only $O(k\log P)$ arith­metic op­er­a­tions mod­ulo $P$.

Now the Rabin-Karp string-match­ing al­go­rithm wants a prime cho­sen uni­formly at ran­dom from among $\{2,\ldots,N\}$. We can get such a ran­dom prime by re­peat­edly choos­ing a ran­dom num­ber in $\{2,\ldots,N\}$ and test­ing if that num­ber is prime using the Miller-Rabin Al­go­rithm. The Prime Num­ber The­o­rem says that we can ex­pect to find a prime after about $\ln N$ guesses.

The­o­rem: There is a ran­dom­ized al­go­rithm that does $O(k\log^2 N)$ mod­u­lar ar­tih­metic op­er­a­tions of in­te­gers of max­i­mum value $N$ and that chooses a prime num­ber uni­formly at ran­dom from $\{2,\ldots,N\}$.

Note that we're count­ing mod­u­lar arith­metic op­er­a­tions here rather than run­ning-time be­cause for re­ally big val­ues of $N$, like those used in pub­lic-key cryp­tog­ra­phy, each arith­metic op­er­a­tion takes more than costant time.