This web site provides a list of errors discovered by readers in

Computing Patterns in Strings

by

Bill Smyth.

Suggestions for improvement are also included, and further
reader participation is welcomed.

(1) Esko Ukkonen points out that reference [U92a] would better have been

On-line construction of suffix trees,
Algorithmica 14 (1995) 249-260.

(2) Kangmin Fan finds two typos in Chapter 12:

p. 333 On the third last line, the displayed equation should read f_6[5-1] = b.
p. 352 On line 5, the end of the displayed material should read (w_4 - w_5).

And on p. 350, conjecture (3) should say "string x[1..n]" so that a length is specified; while at the bottom of p. 355, the factor w_3 should be abababa.

(3) On page 91, in line 2 of the second paragraph under "Freedom from Backtracking", the range should be i-k+1..i.

On p. 98, instead of ">" in equation (4.8), there should be "<".

On p. 100, Exercise 4.2.8, a stronger result actually holds: the factor "alpha" can be removed from both the time and space complexity expressions.

(4) In 2003 four papers have been published [KS03,KSPP03,KA03,SKPP03] that collectively seem to establish the superiority of the suffix array over the suffix tree as a data structure. Thus, if I were writing Chapter 5 today instead of in 2000/2001, I believe I would take a completely different approach: presenting suffix arrays as the main data structure, constructible and searchable in linear time, with basically historical mention of suffix tree algorithms. It is striking that all four of the papers mentioned above were inspired by Farach's linear time suffix tree construction algorithm [F97], but while Farach's algorithm is very complicated, at least three of the new suffix array algorithms turn out to be relatively simple -- in fact, the main algorithm in [KS03] takes scarcely more than a page to explain. Of the four new algorithms, no fewer than three [KS03,KA03,KSPP03] are linear time suffix array construction algorithms, while the fourth [SKPP03] permits search of a suffix array in time linear in pattern length.

Also of great interest in this context is [AKO04], in which it is shown that "every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that uses an enhanced suffix array and solves the same problem in the same time complexity".

[AKO04] Mohamed Ibrahim Abouelhoda, Stefan Kurtz & Enno Ohlebusch, Replacing suffix
trees with enhanced suffix arrays pdf, J. Discrete Algs. 2 (2004) 53-86.
[F97] Martin Farach, Optimal suffix tree construction with large alphabets,
Proc. 38th Annual IEEE Symp. Foundations of Computer Science (1997) 137-143.
[KS03] Juha Karkkainen & Peter Sanders, Simple linear work suffix array construction ps,
Proc. 30th Internat. Colloq. Automata, Languages & Programming (2003) 943-955.
[KSPP03] Dong Kyue Kim, Jeong Seop Sim, Heejin Park & Kunsoo Park, Linear-time
construction of suffix arrays ps, Proc. Fourteenth Annual Symp. Combinatorial
Pattern Matching (2003) 186-199.
[KA03] Pang Ko & Srinivas Aluru, Space efficient linear time construction of suffix
arrays ps, Proc. Fourteenth Annual Symp. Combinatorial Pattern Matching (2003)
200-210.
[SKPP03] Jeong Seop Sim, Dong Kyue Kim, Heejin Park & Kunsoo Park, Linear-time
search in suffix arrays ppt, Proc. Fourteenth Australasian Workshop on
Combinatorial Algorithms (2003) 139-146.

(5) There is a caveat to this apparent advantage of suffix arrays, however. Experimental evidence has been presented [ARSTY04] that the new linear-time algorithms provide little or no advantage over algorithms such as Sadakane's [Sa98] that have supralinear worst-case time but are nevertheless linear in cases of practical interest. On the other hand, it may be that careful implementation of the new algorithms [LP04] can make enough difference to be worthwhile. A recent study [PST05a] indicates however that the recursion required by the linear-time algorithms may be the determining and unavoidable factor in keeping them slower in practice than the direct algorithms based on the Burrows-Wheeler transformation [BW94]. An even more recent survey [PST06] provides an overview of all suffix array construction algorithms known as of June 2006, including asymptotic time requirements, actual space requirements (in bytes), and comparisons of time and space usage on strings that typically arise in practice. As discussed further in [ARSTY04], it also seems that the use of suffix arrays (and of course also suffix trees) for pattern-matching is not efficient except in cases where many patterns need to be matched: the methods developed in [SKPP03] require preprocessing that is much less efficient than the bit-map processing of agrep,

[BW94] M. Burrows & D. J. Wheeler, A Block-Sorting Lossless Data Compression
Algorithm ps, Digital Systems Research Report 124 (1994).
[ARSTY04] Antonitio, P. J. Ryan, W. F. Smyth, Andrew Turpin & Xiaoyang Yu, New suffix
array algorithms -- linear but not fast? ps, Proc. Fifteenth Australasian
Workshop Combinatorial Algorithms (2004) 148-156.
[LP04] Sunglim Lee & Kunsoo Park, Efficient implementation of suffix array construction
algorithms ps, Proc. Fifteenth Australasian Workshop Combinatorial Algorithms
(2004) 64-72.
[PST05a] Simon Puglisi, W. F. Smyth & Andrew Turpin, The performance of linear time
suffix sorting algorithms ps, Proc. Data Compression Conference (2005) 358-367.
[PST06] Simon Puglisi, W. F. Smyth & Andrew Turpin, A taxonomy of suffix array
construction algorithms ps, ACM Computing Surveys, to appear (2006).
[Sa98] Kunihiko Sadakane, A fast algorithm for making suffix arrays and for Burrows-
Wheeler transformation ps, Proc. IEEE Data Compression Conference (1998) 129-138.

(6) In the displayed text on page 396, of course "approxi-mate" should be "approximate".

(7) On page 264, the summary of distance/LCS algorithms could have included mention of the "obverse" to the LCS problem introduced in

[RT98] Anatoly R. Rubinov & Vadim G. Timkovsky, String noninclusion optimization
problems pdf, SIAM J. Discrete Mathematics 11-3 (1998) 456-467.

(8) Thierry Lecroq notes that on page 126, in the second paragraph of the proof of Theorem 5.2.6, "represenation" should be "representation".

(9) At the time of publication of the book, I was unaware of a paper [AKO02] that described an efficient and simple algorithm for computing all the repeats in a given string in time linear in the string length. Unlike the method described in Subsection 13.2.2, it requires only the construction of a single enhanced suffix array and thus constitutes a major improvement.

[AKO02] Mohamed Ibrahim Abouelhoda, Stefan Kurtz & Enno Ohlebusch, The enhanced
suffix array and its application to genome analysis pdf, Proc. Second
Workshop on Algorithms in Bioinformatics, LNCS 2452 (2002) 449-463.

(10) Gerth Stolting Brodal points out that the binary sort, over which so much fuss was made on page 150, in fact does not work! I was trying to make some use of the property that no two elements in the array are distinct, but got it wrong. Jon Bentley [B88] would rightly be ashamed of me. I believe the easiest way to fix Algorithms 5.3.2 & 5.3.3 is to make the following changes to both of them: * in the first line, L := 1; R := n to L := 0; R := n + 1 * in the final line, L = M to R - L = 1 My apologies to all right-thinking computer scientists.

[B88] Jon Bentley, More Programming Pearls, Addison-Wesley (1988), esp. Column 3.

(11) On page 351 I have three times (in lines 5, 12 & 14) used u_R when I meant to use r_R. Ten pages later, in the table at the middle of page 361, the strings x, beta, gamma should be boldface: x, beta, gamma.

(12) Interesting extensions of the approximate periods algorithms discussed in Section 13.4 have been made available for inspection and use at

http://www.uncg.edu/mat/pattern/

(13) Exercise 1.2.11 has been solved! Shu Wang located the following interesting paper:

[FHS04] Abraham Flaxman, Aram W. Harrow & Gregory B. Sorkin, Strings with maximally
many distinct subsequences and substrings ps, Electronic J. Combinatorics 11
(2004). Also in Chapter 1, Grzegorz Herman remarks that in the list of rules given at the top of page 4, the use of "unique" in rules (1) & (2) is misleading, since it suggests that p(x) (respectively, s(x)) cannot equal p(y) (respectively, s(y)) for any y <> x. This was not my intention, since rules (3) & (4) together eliminate this possibility. It is therefore desirable to replace "a unique" in rules (1) & (2) with "exactly one". Nick James questions the need for rule (0) and in general the references to "element with label x" rather than just "element x". The only answer I could give was that I defined it as I did because I am a computer scientist! I know that a string (and a tree, and a graph) can be "unlabelled", but they cannot be unlabelled in a computer -- in order to traverse the structure, each element in it must be somehow identifiable. As stated, these rules are my way of defining "concatentation" (mentioned in the next section but not redefined). In Section 1.3 there is a gap in the formulation of Exercise 1.3.6. In order to solve this exercise, one needs a lemma that does not explicitly appear anywhere:
If b and b' are both borders of x, |b| < |b'|, then b is a border of b'. This lemma should be introduced as part (b) of Exercise 1.3.4.

(14) We now turn to a collection of errors and recommendations for improvement kindly contributed by Thomas Mailund & Christian Pedersen of BRICS, an institute of the Universities of Aarhus & Aarlborg, Denmark. From October 2004 Mailund & Pedersen are giving a course "String Algorithms" based on my book. They have combed carefully through Chapters 5-9 and found many interesting things. I deal with their comments below, chapter-by-chapter (occasionally adding in one or two comments by others as they occur in page order):

(14.1) Chapter 5

Page 112, Exercise 5.1.1
The use of "amortized" here is intended to refer to the fact that, although k
steps will be required to access the k-th sibling individually, nevertheless all
m children of any parent can be accessed in m steps, one step per child.

Page 116
In the displayed material two-thirds down the page, I overzealously refer to a
"nonempty" letter -- let us hope that every letter is nonempty! Also, the search
described in the last paragraph should read "searching for a prefix uv of some
suffix of x".

Page 118
Toward the bottom of the page, two exceptions are listed to the "normal" case, in
which s(parent(head(i))) is used as the starting point in the search for head(i+1)).
The first exception is easy: x[i] = lambda, where for all j < i, x[j] <> lambda.
The second exception is not mysterious, but perhaps requires a clearer explanation:
this is the case where k consecutive lambdas have occurred to the left of position
i in x and now occur again as head(i). We may suppose that the original lambdas were
followed on the right by a letter mu <> lambda; thus there exists an internal node
that corresponds to k-1 consecutive lambdas, and it is this node that should be used
as the starting point in the search for head(i+1). In the pseudocode given for
Algorithm 5.2.1, this case corresponds to the assignment w <-- v[1..|v|-1].

Page 119
The explanation related to the case in which w was not already a node,
given just above the pseudocode for Algorithm 5.2.1, should be clarified:

(1) Change the second-last sentend to end with "a contradiction to the assumption
that head(i+1) <> w".
(2) Add the following sentence to the end of the paragraph: "In other words, by
creating the node w, we also create head(i+1)."

Page 120
Mailund & Pedersen point out that the proof of Theorem 5.2.3 is incomplete: the
quantity m defined in the second paragraph could possibly result from order n
scans of a path of length order n, and so possibly be order n squared. They propose
deleting the last sentence of the paragraph, then adding a new paragraph as follows:

Our problem then is to bound m. First observe that by the way the
tree is constructed, the successor s(u) of every node u, with the
exception of head(i), must already exist in the tree. Furthermore,
all nodes on the path from the root to u have successor nodes that
must lie on the path from the root to s(u). Thus the depth of u in
the tree can be no more than one plus the depth of s(u) -- "one plus"
because possibly the top node on the path from the root to u has the
root as successor. Thus a move from u to s(u) decreases the depth
by at most 1, and so, in the algorithm the move s(parent(head(i)))
decreases the depth by at most 2. Since fastscan only increases
depth, the total of all iterations can at most decrease the depth by
2n. Thus m < 2n, and the total contribution of fastscan to Algorithm
McC is O(3nB).

(14.2) Chapter 6

Page 159
In the first line below the bottom displayed equation on this page, the words

"a proper prefix of itself"

should of course be

"a proper suffix of itself".

Pages 161 & 163
In the fourth last line of Algorithm Dvl and the sixth last line of Algorithm Dvl*,
j - i + 1 should be j - i.

Page 167
Mailund & Pedersen observe that the "minimum" and "maximum" suffixes introduced at
the beginning of Section 6.2 have not been defined. Perhaps an effective way to make
the meaning clear here is simply to insert after "minimum"

(lexicographically least)

and after "maximum"

(lexicographically greatest).

Also at the beginning of the proof of Lemma 6.2.1, CFL (italics) is written when CFL
is meant.

Page 170
At the end of the third line of the second paragraph, "that" should be "the": "in the hope".

The word "prefix" in Lemma 6.2.4 should be replaced by "proper prefix"; alternatively, the
lemma should begin with "For all distinct strings". A related point: in the first line
following this lemma, "arbitrary" should be replaced by "distinct".

The word "suffixes" occurs four times in the paragraph following Lemma 6.2.4 when what is
really meant is "proper suffixes".

Page 171
Lemma 6.2.6 did not appear explicitly in Duval's paper [D83]. I introduced it because,
rightly or wrongly, I thought there was a gap in the original proof. However, as Mailund
& Pedersen observe, the proof of Lemma 6.2.6 does not deal correctly with case (a), where
u = x''x'. In order to complete the proof, it appears a somewhat more lengthy and detailed
analysis is required (suggestions to shorten it welcomed!). (In the revised proof of (a)
presented below, I cope with the inadequacies of HTML by using "<" in place of $<_R$.)

(a) u = x''x', where x'' is a nonempty proper suffix of x.

(1) If x'' is not a prefix of x, then x'' < x[1..|x''|], and so u = x''x' < xx', as
required.
(2) If x'' is a prefix of x, then in order to determine whether or not u < xx', we
need to perform successive comparisons of x[t] with x'[t-|x''|] = x[t-|x''|]
for t = |x''|+1,|x''|+2,...,|u|-|x''|. If for some t, x[t] < x[t-|x''|], then
u < xx', as required; while if x[t-|x''|] < x[t], then x < x[t..|x|], a contradiction.

Thus in order to avoid both of these cases, it must be true that x[t] = x[t-|x''|]
for all t. Since |u| < |xx'|, it must therefore be true that u is a proper prefix
of xx', hence that u < xx', as required.

Page 172
The inequality displayed in the proof of Lemma 6.2.9 doesn't make sense. One way to restore
order is to insert in its place the following text (again using "<" in place of $<_R$):

"... so for the suffix s = w^{r-1}w'(mu)x' of x, we find
x = w^{r-1}w'(lambda)w''w'x' < w^{r-1}w'(mu)x',"

Page 177
The last line of Algorithm 6.3.1 should read i0 <-- i0 + max{ell,1}.

(14.3) Chapter 7

Page 188
Near the end of the paragraph preceding Theorem 7.2.1, a possible ambiguity is avoided by
changing "increments i by at least one" to "increments i by at least one for each execution
of the while loop".

(14.4) Chapter 8

Page 208
At the end of inference (2) and also four lines from the bottom of the page, the condition

x[i'] = p[m]

is insufficient and should be replaced by

x[i'-m+1..i'] = p.

(14.5) Chapter 9

Page 249
In the inequality following "Then we find that" at the middle of the page, "<=" can be
replaced by "<".

Page 253
The proof of Lemma 9.4.2 goes astray in the last four lines, which should be replaced
by the following text:

"it follows that |LCS(x1[1..i],x2[1..j[i-1,h]])| = h, hence that
j[i-1,h] <= j[i,h]. However, by Lemma 9.4.1, j[i-1,h] >= j[i,h],
and so j[i-1,h] = j[i,h]. But this is a contradiction, since
by assumption and the definition of j*, j[i,h] < j* <= j[i-1,h].
Therefore the assumption j[i,h] < j* is false and so j[i,h] = j*,
as required."

Also on this page a couple of typos: at the end of the paragraph following Lemma 9.4.2,
"average-case" should be "average case"; then, three lines from the bottom of the page,
MATCHLIST (italics) should be MATCHLIST.

Page 261
Shu Wang notes that the second for loop in the function trial_distance should begin "for j".