Here we look at a family of algorithmic techniques collectively called fingerprinting.
Freivalds' Algorithm
Remember the obvious $\Theta(n^3)$ algorithm for multiplying two $n\times n$ matrices, $A$ and $B$:
Multiply(A, B, n) for i = 1 to n for j = 1 to n C[i][j] = 0 for k = 1 to n C[i][j] = C[i][j] + A[i][k]*B[k][j]
Suppose, instead that someone gives us three $n\times n$ matrices $A$, $B$, and $C$ and claims that \[ A \times B = C \enspace. \] How can we verify that this is true?
- The obvious thing to do is to compute $A\times B$ ourselves and check if the result is correct
- The obvious matrix-multiplication algorithm takes $\Theta(n^3)$ time.
- Even the fastest (very complicated) algorithms take $\Theta(n^\omega)$ time where $\omega\approx 2.3728639$
- We want to do this in optimal $\Theta(n^2)$ time
The first thing to remember is that, for fixed $a,b\in\R$, $a\neq 0$, the equation $ax=b$ has exactly one solution for $x$ (namely $x=b/a$).
Recall that the dot product of two vectors $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$ is \[ v\cdot w = \sum_{i=1}^n v_i\times w_i \enspace . \]
Fingerprint Lemma: Let $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$ be two real-valued vectors with $v\neq w$, and let $r=(r_1,\ldots,r_n)$ be a vector whose values are integers chosen uniformly and independently at random from ${1,\ldots,K}$. Then \[ \Pr\{r\cdot v = r\cdot w\} \le 1/K \enspace . \]
Proof: Since $v\neq w$, there is at least one index $k$ such that $v_i\neq w_i$. Now, rewrite $r\cdot v = r\cdot w$ as \[ r_i(v_i-w_i) = \sum_{j\neq i} r_j(w_j - v_j) \enspace . \] With everything on the right-hand-side fixed, there is only one value of $r_i$ that satisfies this equation. namely \[ r_i' = \frac{\sum_{j\neq i} r_j(w_j - v_j)}{v_i-w_i} \] But $r_i$ is chosen randomly from among $K$ different values, so the probability that $r_i=r_i'$ is at most $1/K$. ∎
We can think of $r\cdot v$ and $r\cdot w$ as the fingerprints of $v$ and $w$; they are small quantities (a single number) that is almost surely different if $v$ and $w$ are different.
Remember that we can multiply an $n$-vector $r$ by and $n\times n$ matrix $A$ in $O(n^2)$ time. You can read the code right off the formula for vector-matrix matrix-multiplication: If $x = rA$, then \[ x_i = \sum_{j=1}^n r_j A_{i,j} \]
for i = 1 to n x[i] = 0 for j = 1 to n x[i] = x[i] + r[j]*A[i][j]
The result of this multiplication is another $n$-vector. That means we can compute $r\times A\times B$ in $O(n^2)$ time; first compute $v=rA$ and then compute $vB=rAB$. We can also compute $rC$ in $O(n^2)$ time. Clearly, if $A\times B = C$, then $r\times A \times B = r\times C$.
Lemma: If $A\times B\neq C$, and a $r$ is chosen as above, then \[ \Pr\{rAB = rC\} \le 1/K \]
Proof: Let $X=AB$. Then, since $X\neq C$, there must be some indices $i$ and $j$ such that $X_{i,j} \neq C_{i,j}$. In particular, column $j$ of $X$ is not equal to column $j$ of $C$. But $(rX)_i$ is just the dot product of $r$ and column $j$ of $X$ and $(rC)_i$ is the dot product of column $j$ of $C$ with $r$. Therefore, by the fingerprint lemma $\Pr\{(rAB)_i = (rC)_i\} \le 1/K$. ∎
To summarize, we have an algorithm (Freivald's Algorithm) that runs in $\Theta(n^2)$ time and, given three $n\times n$ matrices $A$, $B$, and $C$ will:
- always return
true
if $A\times B=C$; - return
false
with probability at least $1-1/K$ if $A\times B\neq C$
String Matching
Suppose we have a little string $p=p_1,\ldots,p_{m}$ and a big body of text $t=t_1,\ldots,t_{n}$. Here $m<n$ and we want to look for occurrences of $p$ in $t$. More precisely,
- A match is an index $i$ such that $p_j = t_{i+j-1}$ for all $j\in\{1,\ldots,m\}$
We want to find the first match or all matches. The obvious algorithm looks like this:
for i = 1 to n-m j = 1 while j <= m and p[j] = t[i+j-1] j = j + 1 if j > m then output i
This algorithm outputs all matches and runs in $O(nm)$ time, and on some instances it runs in $\Omega(nm)$ time, even when there are no matches to output. This happens, for example, when $p=aaaa\ldots aab$ and $t=aaaa\ldots a$.
Kalai's algorithm
Note that we can think of $p$ and $t$ as sequences of integers in the range ${0,\ldots,k-1}$ where $k$ is the alphabet size. Notice that, if we have a random vector $r\in\{1,\ldots,K\}^m$ of length $m$ then, by the Fingerprint Lemma, if $i$ is not a match, then \[ \Pr\{ r\cdot p = r\cdot (t_{i},\ldots,t_{i+m-1}) \} \le 1/K \enspace . \]
We can compute $r\cdot p$ in $O(m)$ time. Now, if we can compute $r\cdot (t_{i},\ldots,t_{i+m-1})$ for all $i\in\{1,\ldots,n-m\}$, then we have an efficient string matching algorithm. This kind of calculation is called the convolution of $r$ and $t$. That is, we want the convolution vector: \[ \begin{array}{rcl}r_1t_1&+\cdots+&r_mt_m\\ r_1t_2&+\cdots+&r_mt_{m+1}, \\ r_1t_3&+\cdots+&r_mt_{m+2},\\ \vdots &\ddots & \vdots \\ r_1t_{n-m}&+\cdots+&r_mt_{n-1},\\ r_1t_{n-m+1}&+\cdots+&r_mt_{n}\end{array} \] We can be compute this convolution vector in $O(n\log n)$ time using the Fast-Fourier Transform (which we may discuss later in the course.)
This means that we have an algorithm that runs in $O(n\log n)$ time and returns a list of integers $L$ such that:
- If $i$ is a match, then $i$ is in $L$.
- If $i$ is not a match, then $\Pr\{i\in L\} \le 1/K$.
That's already impressive, what even more impressive is that this algorithm can also easily handle wildcards; these are locations in $p$ that match any character. For example, we may have a pattern like "p?st" that matches "past", "post", and "pest". This extension is trivial: At any index where the pattern has a wildcard, we just set the corresponding value in $r$ to zero. Then this character will not contribute to the fingerprint $r\cdot p$ or to $r\cdot (t_i,\ldots,t_{i+m-1})$.
Theorem: Given a pattern string $p$ of length $m$, possibly with wildcards and a text string, $t$ of length $n$, there is an algorithm that runs in $O(n\log n)$ time and returns a list of integers $L$ such that 1. If $i$ is a match, then $i$ is in $L$. 2. If $i$ is not a match, then $\Pr\{i\in L\} \le 1/K$.
Of course, if we take $K \ge m$, then we can even verify each value of $i$ in $O(m)$ time and to see if it really is a match. The expected amount of time we will spend on non-matches is only $O(n)$. [Exercise: write this down carefully.]
The Rabin-Karp Algorithm
Here are my original hand-written notes on this topic. Here's the algorithm implemented in C.
Now we can think of $p$ and $t$ as sequences of integers in the range ${0,\ldots,k-1}$ where $k$ is the alphabet size. We can compress $p$ down to think of $p$ as a single integer \[ I(p) = \sum_{j=0}^{m-1} k^{m-j-1}\times p_j \] Just think of $p$ as a base-$k$ representation of some integer. For example if our alphabet is $0,1,2,3,4,5,6,7,8,9$ and $p="10393"$, then $I(p)=10393$. We can compute $I(p)$ in $O(m)$ time:
x = 1 I = 0 for j = 1 to m I += p[m-j]*x x = x*k
We can restate our problem as that of trying to find the indices $i$ where \[ I(p) = I(t_i\ldots,t_{i+m-1}) \enspace . \]
The nice thing about this is that we only need to compute $I(p)$ once and we have a nice relation between consecutive strings in $t$: \[ I(t_{i+1},\ldots,t_{i+m}) = kI(t_i,\ldots,t_{i+m-1}) - t_{i}k^m + t_{i+m} \]
Ip = I(p) Ir = I(t[0...m-1]) km = k to the power m for i = 0 to n-m if Ip == Ir then output i Ir = k*Ir - t[i]*km + t[i+m]
Great! This looks like a linear time algorithm. There's only one problem; the values of Ip and Ir are huge. We can't store them in a single machine register, so those multiplications and additions each take (at least) $\Omega(m)$ time. We need to make things smaller still.
Number Theory to the Rescue
To do this, we're going to pick a prime number $P$ and do all our arithmetic modulo $P$, so all the numbers we deal with are in the range ${0,\ldots,P-1}$. If $P$ is small enough to fit into a register, then the algorithm definitely runs in $O(n)$ time, but is it correct? Not always: just because \[ I(p)\bmod P = I(t_i,\ldots,t_{i+m-1}) \bmod P \] doesn't mean $I(P)=I(t_i,\ldots,t_{i+m-1})$.
We're going to make sure that it's probably works most of the time. The statement \[ x \bmod P = y \bmod P \] really means $x=aP + b$ and $y=a'P + b$, for some integers $a$, $a'$ and $b$. This means \[ x-y = (a-a')P \] So we have to worry about the case where $x-y$ is a multiple of $P$. This only happens when $P$ is a prime factor of $|x-y|$. How many prime factors does $x-y$ have? Each prime factor is at least $2$, so if $|x-y|$ has $q$ prime factors, then $|x-y| \ge 2^q$. This means $q \le \log_2|x-y|$. So $|x-y|$ has at most $\log_2|x-y|$ prime factors.
Now our $x$'s and $y$'s are integers in the range $\{0,\ldots,k^m\}$, so there are only $m\log_2 k$ prime factors we have to worry about. Now we need a result from number theory:
Theorem: (Prime Number Theorem) The number of primes less than $N$ is about $N/\ln N$.
Hey, that a lot of primes! We can use this:
Lemma: If $P$ is a prime number chosen uniformly at random from the set of all primes less than $N$ and $I(p) \neq I(t_{i},\ldots,t_{i+m-1})$ , then \[ \Pr\{I(p)\bmod P = I(t_{i},\ldots,t_{i+m-1})\bmod P \} \le \frac{m\ln N\log_2 k}{N} \]
Proof: We've already given it. We have $N/\ln N$ choices for $P$ and there are at most $m\log_2 k$ that cause $I(p)\bmod P = I(t_{i},\ldots,t_{i+m-1})\bmod P$. ∎
How big do we need $N$ to be? If we take $N > 2Q m\log m\log k$, then the probability in the lemma becomes at most $1/Q$. Usually, we would take $N$ to be about as big as we could comfortably fit into a machine word $N=2^{32}$ or $N=2^{64}$. Here's the whole thing in C.
Miller-Rabin Primality Testing
If we want to test if a number $P$ is prime, we can use a bit of number theory. Write $P-1$ as $2^s d$ for integer $s$ and odd $d$. Now, if $P$ is prime, then for every $a\in\{1,\ldots,P-1\}$,
- $a^d \equiv 1\pmod P$; or
- $a^{2rd} \equiv -1 \pmod P$, for some $r\in\{0,\ldots,s-1\}$.
On the other hand, if $P$ is not prime and odd, then for at least three-quarters of the values of $a\in\{1,\ldots,P-1\}$,
- $a^d \not\equiv 1\pmod P$; or
- $a^{2rd} \not\equiv -1 \pmod P$, for all $r\in\{0,\ldots,s-1\}$.
So the test is easy. We pick a random $a$ and check if the first or second case holds. In the case when $P$ is not prime, we will correctly detect it with probability at least $3/4$. If we want to be more certain, then we can repeat the experiment $k$ times and we will correctly detect composites with probability at least $1-4^{-k}$ using algorithm that requires only $O(k\log P)$ arithmetic operations modulo $P$.
Now the Rabin-Karp string-matching algorithm wants a prime chosen uniformly at random from among $\{2,\ldots,N\}$. We can get such a random prime by repeatedly choosing a random number in $\{2,\ldots,N\}$ and testing if that number is prime using the Miller-Rabin Algorithm. The Prime Number Theorem says that we can expect to find a prime after about $\ln N$ guesses.
Theorem: There is a randomized algorithm that does $O(k\log^2 N)$ modular artihmetic operations of integers of maximum value $N$ and that chooses a prime number uniformly at random from $\{2,\ldots,N\}$.
Note that we're counting modular arithmetic operations here rather than running-time because for really big values of $N$, like those used in public-key cryptography, each arithmetic operation takes more than costant time.