COMP4804: Algorithms II
$\newcommand{\R}{\mathbb{R}}\DeclareMathOperator{\E}{\mathbf{E}}\DeclareMathOperator{\deg}{deg}$
Fingerprinting

Here we look at a family of algorithmic techniques collectively called fingerprinting.

Freivalds' Algorithm

Remember the obvious $\Theta(n^3)$ algorithm for multiplying two $n\times n$ matrices, $A$ and $B$:

Multiply(A, B, n)
  for i = 1 to n
    for j = 1 to n
      C[i][j] = 0
      for k = 1 to n
        C[i][j] = C[i][j] + A[i][k]*B[k][j]

Suppose, instead that someone gives us three $n\times n$ matrices $A$, $B$, and $C$ and claims that \[ A \times B = C \enspace. \] How can we verify that this is true?

The first thing to remember is that, for fixed $a,b\in\R$, $a\neq 0$, the equation $ax=b$ has exactly one solution for $x$ (namely $x=b/a$).

a non-trivial linear equation in one variable has one solution

Recall that the dot product of two vectors $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$ is \[ v\cdot w = \sum_{i=1}^n v_i\times w_i \enspace . \]

Fingerprint Lemma: Let $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$ be two real-valued vectors with $v\neq w$, and let $r=(r_1,\ldots,r_n)$ be a vector whose values are integers chosen uniformly and independently at random from ${1,\ldots,K}$. Then \[ \Pr\{r\cdot v = r\cdot w\} \le 1/K \enspace . \]

Proof: Since $v\neq w$, there is at least one index $k$ such that $v_i\neq w_i$. Now, rewrite $r\cdot v = r\cdot w$ as \[ r_i(v_i-w_i) = \sum_{j\neq i} r_j(w_j - v_j) \enspace . \] With everything on the right-hand-side fixed, there is only one value of $r_i$ that satisfies this equation. namely \[ r_i' = \frac{\sum_{j\neq i} r_j(w_j - v_j)}{v_i-w_i} \] But $r_i$ is chosen randomly from among $K$ different values, so the probability that $r_i=r_i'$ is at most $1/K$. ∎

We can think of $r\cdot v$ and $r\cdot w$ as the fingerprints of $v$ and $w$; they are small quantities (a single number) that is almost surely different if $v$ and $w$ are different.

Remember that we can multiply an $n$-vector $r$ by and $n\times n$ matrix $A$ in $O(n^2)$ time. You can read the code right off the formula for vector-matrix matrix-multiplication: If $x = rA$, then \[ x_i = \sum_{j=1}^n r_j A_{i,j} \]

for i = 1 to n
  x[i] = 0
  for j = 1 to n
    x[i] = x[i] + r[j]*A[i][j]

The result of this multiplication is another $n$-vector. That means we can compute $r\times A\times B$ in $O(n^2)$ time; first compute $v=rA$ and then compute $vB=rAB$. We can also compute $rC$ in $O(n^2)$ time. Clearly, if $A\times B = C$, then $r\times A \times B = r\times C$.

Lemma: If $A\times B\neq C$, and a $r$ is chosen as above, then \[ \Pr\{rAB = rC\} \le 1/K \]

Proof: Let $X=AB$. Then, since $X\neq C$, there must be some indices $i$ and $j$ such that $X_{i,j} \neq C_{i,j}$. In particular, column $j$ of $X$ is not equal to column $j$ of $C$. But $(rX)_i$ is just the dot product of $r$ and column $j$ of $X$ and $(rC)_i$ is the dot product of column $j$ of $C$ with $r$. Therefore, by the fingerprint lemma $\Pr\{(rAB)_i = (rC)_i\} \le 1/K$. ∎

To summarize, we have an algorithm (Freivald's Algorithm) that runs in $\Theta(n^2)$ time and, given three $n\times n$ matrices $A$, $B$, and $C$ will:

String Matching

Suppose we have a little string $p=p_1,\ldots,p_{m}$ and a big body of text $t=t_1,\ldots,t_{n}$. Here $m<n$ and we want to look for occurrences of $p$ in $t$. More precisely,

We want to find the first match or all matches. The obvious algorithm looks like this:

for i = 1 to n-m
    j = 1
    while j <= m and p[j] = t[i+j-1]
        j = j + 1
    if j > m then output i

This algorithm outputs all matches and runs in $O(nm)$ time, and on some instances it runs in $\Omega(nm)$ time, even when there are no matches to output. This happens, for example, when $p=aaaa\ldots aab$ and $t=aaaa\ldots a$.

Kalai's algorithm

Note that we can think of $p$ and $t$ as sequences of integers in the range ${0,\ldots,k-1}$ where $k$ is the alphabet size. Notice that, if we have a random vector $r\in\{1,\ldots,K\}^m$ of length $m$ then, by the Fingerprint Lemma, if $i$ is not a match, then \[ \Pr\{ r\cdot p = r\cdot (t_{i},\ldots,t_{i+m-1}) \} \le 1/K \enspace . \]

We can compute $r\cdot p$ in $O(m)$ time. Now, if we can compute $r\cdot (t_{i},\ldots,t_{i+m-1})$ for all $i\in\{1,\ldots,n-m\}$, then we have an efficient string matching algorithm. This kind of calculation is called the convolution of $r$ and $t$. That is, we want the convolution vector: \[ \begin{array}{rcl}r_1t_1&+\cdots+&r_mt_m\\ r_1t_2&+\cdots+&r_mt_{m+1}, \\ r_1t_3&+\cdots+&r_mt_{m+2},\\ \vdots &\ddots & \vdots \\ r_1t_{n-m}&+\cdots+&r_mt_{n-1},\\ r_1t_{n-m+1}&+\cdots+&r_mt_{n}\end{array} \] We can be compute this convolution vector in $O(n\log n)$ time using the Fast-Fourier Transform (which we may discuss later in the course.)

This means that we have an algorithm that runs in $O(n\log n)$ time and returns a list of integers $L$ such that:

  1. If $i$ is a match, then $i$ is in $L$.
  2. If $i$ is not a match, then $\Pr\{i\in L\} \le 1/K$.

That's already impressive, what even more impressive is that this algorithm can also easily handle wildcards; these are locations in $p$ that match any character. For example, we may have a pattern like "p?st" that matches "past", "post", and "pest". This extension is trivial: At any index where the pattern has a wildcard, we just set the corresponding value in $r$ to zero. Then this character will not contribute to the fingerprint $r\cdot p$ or to $r\cdot (t_i,\ldots,t_{i+m-1})$.

Theorem: Given a pattern string $p$ of length $m$, possibly with wildcards and a text string, $t$ of length $n$, there is an algorithm that runs in $O(n\log n)$ time and returns a list of integers $L$ such that 1. If $i$ is a match, then $i$ is in $L$. 2. If $i$ is not a match, then $\Pr\{i\in L\} \le 1/K$.

Of course, if we take $K \ge m$, then we can even verify each value of $i$ in $O(m)$ time and to see if it really is a match. The expected amount of time we will spend on non-matches is only $O(n)$. [Exercise: write this down carefully.]

The Rabin-Karp Algorithm

Here are my original hand-written notes on this topic. Here's the algorithm implemented in C.

Now we can think of $p$ and $t$ as sequences of integers in the range ${0,\ldots,k-1}$ where $k$ is the alphabet size. We can compress $p$ down to think of $p$ as a single integer \[ I(p) = \sum_{j=0}^{m-1} k^{m-j-1}\times p_j \] Just think of $p$ as a base-$k$ representation of some integer. For example if our alphabet is $0,1,2,3,4,5,6,7,8,9$ and $p="10393"$, then $I(p)=10393$. We can compute $I(p)$ in $O(m)$ time:

x = 1
I = 0
for j = 1 to m
    I += p[m-j]*x
    x = x*k

We can restate our problem as that of trying to find the indices $i$ where \[ I(p) = I(t_i\ldots,t_{i+m-1}) \enspace . \]

The nice thing about this is that we only need to compute $I(p)$ once and we have a nice relation between consecutive strings in $t$: \[ I(t_{i+1},\ldots,t_{i+m}) = kI(t_i,\ldots,t_{i+m-1}) - t_{i}k^m + t_{i+m} \]

Ip = I(p)
Ir = I(t[0...m-1])
km = k to the power m
for i = 0 to n-m
    if Ip == Ir then output i
    Ir = k*Ir - t[i]*km + t[i+m]

Great! This looks like a linear time algorithm. There's only one problem; the values of Ip and Ir are huge. We can't store them in a single machine register, so those multiplications and additions each take (at least) $\Omega(m)$ time. We need to make things smaller still.

Number Theory to the Rescue

To do this, we're going to pick a prime number $P$ and do all our arithmetic modulo $P$, so all the numbers we deal with are in the range ${0,\ldots,P-1}$. If $P$ is small enough to fit into a register, then the algorithm definitely runs in $O(n)$ time, but is it correct? Not always: just because \[ I(p)\bmod P = I(t_i,\ldots,t_{i+m-1}) \bmod P \] doesn't mean $I(P)=I(t_i,\ldots,t_{i+m-1})$.

We're going to make sure that it's probably works most of the time. The statement \[ x \bmod P = y \bmod P \] really means $x=aP + b$ and $y=a'P + b$, for some integers $a$, $a'$ and $b$. This means \[ x-y = (a-a')P \] So we have to worry about the case where $x-y$ is a multiple of $P$. This only happens when $P$ is a prime factor of $|x-y|$. How many prime factors does $x-y$ have? Each prime factor is at least $2$, so if $|x-y|$ has $q$ prime factors, then $|x-y| \ge 2^q$. This means $q \le \log_2|x-y|$. So $|x-y|$ has at most $\log_2|x-y|$ prime factors.

Now our $x$'s and $y$'s are integers in the range $\{0,\ldots,k^m\}$, so there are only $m\log_2 k$ prime factors we have to worry about. Now we need a result from number theory:

Theorem: (Prime Number Theorem) The number of primes less than $N$ is about $N/\ln N$.

Hey, that a lot of primes! We can use this:

Lemma: If $P$ is a prime number chosen uniformly at random from the set of all primes less than $N$ and $I(p) \neq I(t_{i},\ldots,t_{i+m-1})$ , then \[ \Pr\{I(p)\bmod P = I(t_{i},\ldots,t_{i+m-1})\bmod P \} \le \frac{m\ln N\log_2 k}{N} \]

Proof: We've already given it. We have $N/\ln N$ choices for $P$ and there are at most $m\log_2 k$ that cause $I(p)\bmod P = I(t_{i},\ldots,t_{i+m-1})\bmod P$. ∎

How big do we need $N$ to be? If we take $N > 2Q m\log m\log k$, then the probability in the lemma becomes at most $1/Q$. Usually, we would take $N$ to be about as big as we could comfortably fit into a machine word $N=2^{32}$ or $N=2^{64}$. Here's the whole thing in C.

Miller-Rabin Primality Testing

If we want to test if a number $P$ is prime, we can use a bit of number theory. Write $P-1$ as $2^s d$ for integer $s$ and odd $d$. Now, if $P$ is prime, then for every $a\in\{1,\ldots,P-1\}$,

On the other hand, if $P$ is not prime and odd, then for at least three-quarters of the values of $a\in\{1,\ldots,P-1\}$,

So the test is easy. We pick a random $a$ and check if the first or second case holds. In the case when $P$ is not prime, we will correctly detect it with probability at least $3/4$. If we want to be more certain, then we can repeat the experiment $k$ times and we will correctly detect composites with probability at least $1-4^{-k}$ using algorithm that requires only $O(k\log P)$ arithmetic operations modulo $P$.

Now the Rabin-Karp string-matching algorithm wants a prime chosen uniformly at random from among $\{2,\ldots,N\}$. We can get such a random prime by repeatedly choosing a random number in $\{2,\ldots,N\}$ and testing if that number is prime using the Miller-Rabin Algorithm. The Prime Number Theorem says that we can expect to find a prime after about $\ln N$ guesses.

Theorem: There is a randomized algorithm that does $O(k\log^2 N)$ modular artihmetic operations of integers of maximum value $N$ and that chooses a prime number uniformly at random from $\{2,\ldots,N\}$.

Note that we're counting modular arithmetic operations here rather than running-time because for really big values of $N$, like those used in public-key cryptography, each arithmetic operation takes more than costant time.