The Remarkable “Curvature Blindness” Illusion

[h/t Danny Yee's blog]

*For all my social networking posts, see my Google+ page*

Labels:
graphics

Mind. Blown.

[h/t Danny Yee's blog]

*For all my social networking posts, see
my Google+ page
*

The Remarkable “Curvature Blindness” Illusion

[h/t Danny Yee's blog]

Labels:
graphics,
mathematics,
python

Python’s generator expressions (discussed in two previous posts) are very useful for stream programming. But sometimes, you want some state to be preserved between calls. That is where we need full Python generators.

A generator expression is actually a special case of a generator that can be written inline. The generator expression

*generator* starts again directly after the previous *recurrence relations*, where the current value is expressed in terms of previous values (kept as remembered state).

##
Running total

The

$$ T_N = \sum_{i=1}^N x_i$$ We can instead write this sum as a recurrence relation: $$ T_0 = 0 ; T_n = T_{n-1} + x_n$$ Expanding this out explicitly we get $$T_0 = 0 ; T_1 = T_0 + x_1 = 0 + x_1 = x_1 ; T_2 = T_1 + x_2 = x_1 + x_2 ; \ldots$$ +++T_0+++ is the initial state, which is observed directly. The recurrence term +++T_n+++ tells us what needs to be remembered of the previous state(s), and how to use that to generate the current value.

##
Factorial

Similarly we can write the factorial as an explicit product:$$N! = F_N = \prod_{i=1}^N i$$
And we can write it as a recurrence relation:
$$ F_1 = 1; F_n = n F_{n-1} $$
+++F_1+++ is the initial state, which should be the first output of the generator. We can

*after* the

##
Fibonacci numbers

The perennially popular Fibonacci numbers are naturally defined using a recurrence relation, involving two previous states:$$F_1 = F_2 = 1 ; F_n = F_{n-1} + F_{n-2}$$
This demonstrates how we can store more state than just the result of the previous

##
Logistic map

Difference equations, discrete-time analogues of differential equations, are a form of recurrence relation: the value of the state at the next time step is defined in terms of its value at previous timesteps.

The logistic map is a famous difference equation, exhibiting a range of periodic and chaotic behaviours, depending on the value of its paramenter +++r+++. $$x_0 \in (0,1) ; x_{n+1} = r x_n(1-x_n)$$

## Faster +++\pi+++

Many series for generating +++\pi+++ converge very slowly. One that converges extremely quickly is:$$
\pi = \sum_{i=0}^\infty \frac{(i!)^2 \, 2^{i+1}}{(2i+1)!}$$We could use generators for each component of the term to code this us as:

Let $$F_{k-1} = \frac{((k-1)!)^2 \, 2^{k}}{(2k-1)!}$$ Then $$\begin{eqnarray} F_{k} &=& \frac{((k)!)^2 \, 2^{k+1}}{(2k+1)!} \ &=& \frac{((k-1)! k)^2 \, 2 \times 2^{k}}{(2k-1)!(2k)(2k+1)} \ &=& \frac{((k-1)!)^2 \, 2^{k} 2k^2}{(2k-1)! 2k(2k+1)} \ &=& F_{k-1} \frac{k}{2k+1} \end{eqnarray}$$ So $$F_0 = 2 ; F_{n} = F_{n-1} \frac{n}{2n+1}$$

###
Sieving for primes

No discussion of generators would be complete without including a prime number generator.
The standard algorithm is quite straightforward:

*even more* fun than generator expressions!

And in fact, there’s*yet more* fun to be had with generators, because it is possible to send values to them on each call, too; but that’s another story for another time.

A generator expression is actually a special case of a generator that can be written inline. The generator expression

`(<expr> for i in <iter>)`

is shorthand for the generator:def some_gen(): for i in <iter>: yield <expr>For example,

`(n*n for n in count(1))`

can be equivalently written as:```
from itertools import *
def squares():
for n in count(1):
yield n*n
print_for(squares())
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
```

The `yield`

statement acts somewhat like a `return`

in providing its value. But the crucial difference is what happens on the next call. An ordinary function starts off again from the top; a `yield`

. This becomes important if the generator includes some internal state: this state is maintained between calls. That is, we can have memory of previous state on the next call. This is particularly useful for `accumulate()`

generator provides a running total of its iterator argument. We can write this as an explicit sum:$$ T_N = \sum_{i=1}^N x_i$$ We can instead write this sum as a recurrence relation: $$ T_0 = 0 ; T_n = T_{n-1} + x_n$$ Expanding this out explicitly we get $$T_0 = 0 ; T_1 = T_0 + x_1 = 0 + x_1 = x_1 ; T_2 = T_1 + x_2 = x_1 + x_2 ; \ldots$$ +++T_0+++ is the initial state, which is observed directly. The recurrence term +++T_n+++ tells us what needs to be remembered of the previous state(s), and how to use that to generate the current value.

```
def running_total(gen):
tot = 0
for x in gen:
tot += x
yield tot
print_for(repeat(3))
print_for(running_total(repeat(3)))
print_for(running_total(count(1)))
```

```
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
3, 6, 9, 12, 15, 18, 21, 24, 27, 30, ...
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
```

`yield`

this directly on the first call, then (on the next call) go into a loop `yield`

ing the following values.```
def fact():
f = 1
yield f
for n in count(2):
f *= n
yield f
print_for(fact())
```

```
1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, ...
```

We can remove the special case, and instead calculate the next value `yield`

in the loop. The subsequent call then picks up at that calculation.```
def fact():
f = 1
for n in count(2):
yield f
f *= n
print_for(fact())
```

```
1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, ...
```

`yield`

. ```
def fib(start=(1,1)):
a,b = start
while True:
yield a
a,b = b,a+b
print_for(fib(),20)
print_for(fib((0,3)),20)
```

```
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...
0, 3, 3, 6, 9, 15, 24, 39, 63, 102, 165, 267, 432, 699, 1131, 1830, 2961, 4791, 7752,
12543, ...
```

The logistic map is a famous difference equation, exhibiting a range of periodic and chaotic behaviours, depending on the value of its paramenter +++r+++. $$x_0 \in (0,1) ; x_{n+1} = r x_n(1-x_n)$$

```
def logistic_map(r=4):
x = 0.1
while True:
yield x
x = r * x * (1-x)
print_for(logistic_map())
print_for(logistic_map(2))
```

```
0.1, 0.36000000000000004, 0.9216, 0.28901376000000006, 0.8219392261226498,
0.5854205387341974, 0.970813326249438, 0.11333924730376121, 0.4019738492975123,
0.9615634951138128, ...
0.1, 0.18000000000000002, 0.2952, 0.41611392, 0.4859262511644672, 0.49960385918742867,
0.49999968614491325, 0.49999999999980305, 0.5, 0.5, ...
```

We can plot these values to see the chaotic (+++r=4+++) and periodic (+++r=2+++) behaviours over time. (The `%matplotlib inline`

“magic” allows the plot to display in a Jupyter notebook.)```
%matplotlib inline
import matplotlib.pyplot as py
py.plot(list(islice(logistic_map(),200)))
py.plot(list(islice(logistic_map(3.5),200)))
```

We can also produce the more usual plot, with the parameter +++r+++ running along the +++x+++-axis, highlighting the various areas of periodicity and chaos.```
from numpy import arange
start = 2.8
stop = 4
step = (stop-start)*0.002
skip = int(1/step) # no of initial vals to skip (converged to attractor)
npts = 150 # no of values to plot per value of lambda
for r in arange(start,stop,step):
yl = logistic_map(r)
list(islice(yl,skip)) # consume first few items
py.scatter(list(islice(repeat(r,npts),npts)), list(islice(yl,npts)), marker='.', s=1)
py.xlim(start,stop)
py.ylim(0,1)
```

(Here I have used `s=1`

to get a small point, rather than use a comma marker to get a pixel, because there is currently a bug in the use of pixels in scatter plots.)```
import operator
def pi_term():
fact = accumulate(count(1), operator.mul)
factsq = (i*i for i in fact)
twoi1 = (2**(i+1) for i in count(1))
def fact2i1():
i,f = 3,6
while True:
yield f
f = f * (i+1) * (i+2)
i += 2
yield 2 # the i=0 term (needed because of 0! = 1 issues)
for i in map(lambda x,y,z: x*y/z, factsq, twoi1, fact2i1()):
yield i
print_for(accumulate(pi_term()),40)
```

```
2, 2.6666666666666665, 2.933333333333333, 3.0476190476190474, 3.098412698412698,
3.121500721500721, 3.132156732156732, 3.1371295371295367, 3.1394696806461506,
3.140578169680336, 3.141106021601377, 3.1413584725201353, 3.1414796489611394,
3.1415379931734746, 3.1415661593449467, 3.1415797881375944, 3.1415863960370602,
3.141589605588229, 3.1415911669915006, 3.1415919276751456, 3.1415922987403384,
3.141592479958223, 3.1415925685536337, 3.1415926119088344, 3.1415926331440347,
3.1415926435534467, 3.141592648659951, 3.14159265116678, 3.141592652398205,
3.1415926530034817, 3.1415926533011587, 3.141592653447635, 3.141592653519746,
3.1415926535552634, 3.141592653572765, 3.1415926535813923, 3.141592653585647,
3.141592653587746, 3.141592653588782, 3.141592653589293, ...
```

However, these terms get large, and take a long time to calculate.
```
import timeit
%time [ i for i in islice(pi_term(),10000) if i < 0 ];
```

```
Wall time: 10.9 s
```

We can instead write the terms as a recurrence relation.Let $$F_{k-1} = \frac{((k-1)!)^2 \, 2^{k}}{(2k-1)!}$$ Then $$\begin{eqnarray} F_{k} &=& \frac{((k)!)^2 \, 2^{k+1}}{(2k+1)!} \ &=& \frac{((k-1)! k)^2 \, 2 \times 2^{k}}{(2k-1)!(2k)(2k+1)} \ &=& \frac{((k-1)!)^2 \, 2^{k} 2k^2}{(2k-1)! 2k(2k+1)} \ &=& F_{k-1} \frac{k}{2k+1} \end{eqnarray}$$ So $$F_0 = 2 ; F_{n} = F_{n-1} \frac{n}{2n+1}$$

```
import math
def pi_term_rec():
pt = 2
for n in count(1):
yield pt
pt = pt * n/(2*n+1)
print_for(pi_term_rec(),30)
print_for(accumulate(pi_term_rec()),30)
print(math.pi)
```

We can see how quickly the terms shrink.
```
2, 0.6666666666666666, 0.26666666666666666, 0.1142857142857143, 0.0507936507936508,
0.02308802308802309, 0.010656010656010658, 0.004972804972804974, 0.002340143516614105,
0.0011084890341856288, 0.0005278519210407756, 0.0002524509187586318,
0.00012117644100414327, 5.8344212335328244e-05, 2.816617147222743e-05,
1.3628792647851983e-05, 6.607899465625204e-06, 3.209551169017956e-06,
1.561403271414141e-06, 7.606836450479149e-07, 3.710651927063e-07,
1.8121788481005348e-07, 8.85954103515817e-08, 4.335520081034849e-08,
2.123520039690538e-08, 1.0409411959267343e-08, 5.1065039800179424e-09,
2.5068292265542627e-09, 1.2314248832196377e-09, 6.052766375147372e-10, ...
2, 2.6666666666666665, 2.933333333333333, 3.0476190476190474, 3.098412698412698,
3.121500721500721, 3.132156732156732, 3.1371295371295367, 3.1394696806461506,
3.140578169680336, 3.141106021601377, 3.1413584725201353, 3.1414796489611394,
3.1415379931734746, 3.1415661593449467, 3.1415797881375944, 3.1415863960370602,
3.141589605588229, 3.1415911669915006, 3.1415919276751456, 3.1415922987403384,
3.141592479958223, 3.1415925685536337, 3.1415926119088344, 3.1415926331440347,
3.1415926435534467, 3.141592648659951, 3.14159265116678, 3.141592652398205,
3.1415926530034817, ...
3.141592653589793
```

Not only is this code simpler, it is also much faster:
```
%time [ i for i in islice(pi_term_rec(),10000) if i < 0 ]
```

```
Wall time: 3.98 ms
```

- Generate consecutive whole numbers, and test each for divisibility. If the current number isn’t divisible by anything, yield a new prime.
- Only test for divisibility by primes, and only up to the square root of the number being tested; this requires keeping a record of the primes found so far.
- Optimisation: treat 2 as a special case, and generate and test only the odd numbers.

```
def primessqrt():
primessofar = []
yield 2
for n in count(3,2): # check odd numbers only, starting with 3
sqrtn = int(math.sqrt(n))
testprimes = takewhile(lambda i: i<=sqrtn, primessofar)
isprime = all(n % p for p in testprimes) # n % p == 0 if n is divisible
if isprime: # if new prime, add to list and yield, else continue
yield n
primessofar.append(n)
print_while(primessqrt(), 200)
```

```
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83,
89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179,
181, 191, 193, 197, 199, ...
```

This algorithm is often referred to as the Sieve of Eratosthenes, but the true sieve is somewhat different in its operation. The sieve doesn’t require any square roots or divisions: it uses addition only, moving a marker through the numbers, striking out the multiples. Using generators we can do this lazily, using a dictionary to store which marks have struck out a particular value (for example, when we get to 15, the dictionary entry will show that it has been struck out by 3).

- Generate consecutive whole numbers, and check whether each one’s dictionary entry has any markers.
- If there are no markers, yield a new prime +++p+++; start a new marker to strike out multiples of +++p+++ (start it at +++p^2+++, for the same reason the previous algorithm only needs to test values up to +++\sqrt n+++).
- If there are markers (such as for 15), the value isn’t prime; move each marker on to the next value it strikes out (so for 15, move the 3 marker on 6 places to strike out 21: each marker is moved on by twice its value, since we are optimising by not considering even values)
- Delete the current dictionary entry (so that the dictionary grows only as the number of primes found so far, rather than as the number of values checked, speeding up dictionary access).

```
from collections import defaultdict
def primesieve():
yield 2
sieve = defaultdict(set) # dict of n:{divisor} elems
for n in count(3,2): # check odd numbers only
if sieve[n] : # there are divisors, so not prime
for d in sieve[n]: # move the sieve markers on
sieve[n+d].add(d)
else: #set empty, so prime
sieve[n*n].add(2*n)
yield n
# remove current dict item
del sieve[n]
print_for(primesieve(), 100)
```

```
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83,
89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179,
181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277,
281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389,
397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499,
503, 509, 521, 523, 541, ...
```

And this “true” sieve is faster than the divisor version.
```
%time [ i for i in islice(primessqrt(),100000) if i < 0 ]
%time [ i for i in islice(primesieve(),100000) if i < 0 ]
```

```
Wall time: 4.69 s
Wall time: 735 ms
```

So full Python generators are And in fact, there’s

One of the tasks of an academic is examining research students. This involves reading a thesis, then discussing it with the student in a viva. In the UK the viva is held behind closed doors, and usually involves only two examiners, one internal (same university as the student) and one external. The scheme is different on the continent: the defence is public, following a lecture, and there may be multiple “opponents”, or even an entire panel, of examiners.

If every research student needs (a minimum of) two examiners, that implies for every student I supervise, I should examine at least two others (one as internal, one as external, to keep the numbers balanced). In fact, since many universities like to have “senior” external examiners, I should probably do a few more than that.

Each year, I usually take on one or two new students. So, each year, I should probably examine about three to five students. So far this year I have examined only two: both external continental defences, one in Trondheim in April, one in Lyon in July.

That implies I should have two or three more to do this year (if it is an average year). And sure enough, in November a pile of theses started raining down in a seemingly never-ending stream on my desk.

So the total number is about right, but all at once is a bit of a nightmare. Two internal students, two external students, all to be examined before the end of January. I know what*I’m* going to be reading over Christmas.

If every research student needs (a minimum of) two examiners, that implies for every student I supervise, I should examine at least two others (one as internal, one as external, to keep the numbers balanced). In fact, since many universities like to have “senior” external examiners, I should probably do a few more than that.

Each year, I usually take on one or two new students. So, each year, I should probably examine about three to five students. So far this year I have examined only two: both external continental defences, one in Trondheim in April, one in Lyon in July.

That implies I should have two or three more to do this year (if it is an average year). And sure enough, in November a pile of theses started raining down in a seemingly never-ending stream on my desk.

So the total number is about right, but all at once is a bit of a nightmare. Two internal students, two external students, all to be examined before the end of January. I know what

Labels:
python

When I was first looking at Python generators, I discovered David Beazley’s wonderful presentation on
*Generator Tricks for Systems Programmers*,
where he gives a lovely example of using generators in a Unix-pipeline fashion,
each generator piping its output into the next.

Before I start using this approach, I’m going to need some debugging help. I like to write-a-line-test-a-line, and I use

We can use

*literate programming* solution in Pascal, including a novel data structure, the *trie*. McIlroy then critiques Knuth’s solution, and suggests a 6 stage Unix pipeline to do the same job:

Let’s use some online text files for our test input.

*lorem ipsum* file has the advantage of containing only words without hyphens (no words split over lines) or apostrophes (all words are simply strings of letters) or numbers or other weirdnesses, but there are commas and semicolons and full stops (periods).
The Gutenberg *Great Expectations* has more clutter, but since this isn’t an exercise in parsing text, I will take the simpler properties for granted.

Let’s open the file, and print out the first few lines, to see what we have. (The raw input is binary, so we used

We can do the same by making use of the Python

Next in McIlroy’s pipeline is the

We can use a*default dictionary* to save us needing special code when a new word is discovered.

*pellentesque* gives 17 hits, so that’s looking good.

The dictionary (175 items) is*much* smaller than the entire file (1368 words), drastically saving on memory.
We can now sort on the dictionary values to get a list of words in frequency order, then slice off the top *k* of them.

*Great Expectations*:

*stop words* – from the dictionary first:

*pip* and *havisham*, which are special for this particular text.
Also, there is clearly a lot of conversation: *said* is the most common non-stop word.

We also see a couple of rather strange ‘words’:*s* and *t*. This is because the original assumptions of no apostrophes (and also of no hyphens)
doesn’t actually hold for the *Great Expectations* file, so a more sophisticated definition of a ‘word’ is needed. But that’s another story for another time.

Before I start using this approach, I’m going to need some debugging help. I like to write-a-line-test-a-line, and I use

`print()`

statements a lot
(I was weaned on Fortran IV).
But printing a generator’s content consumes that content,
messing up the following code (unless you comment out and rerun,
which is just nasty).We can use

`tee()`

to build a generator debugging printer.
Tee the generator, print from the first element, and return the second (untouched) element (see my previous post on generators for the definition of `print_for()`

):```
from itertools import *
def print_debug(gen, n=10, sep=', ', end=", ...\n", enum=False):
gen_tee = tee(gen)
if enum: # add a count to the items
print(*list(islice(enumerate(gen_tee[0],1),n)), sep=sep, end=end)
else:
print(*list(islice(gen_tee[0],n)), sep=sep, end=end)
return gen_tee[1]
counter = count(1)
print_for(counter)
print_for(counter)
counter = count(1)
counter = print_debug(counter)
print_for(counter)
```

```
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
```

Okay, back to Unix pipelines.
There’s a lovely little paper from three decades ago.Bentley, J., Knuth, D. & McIlroy, D., 1986. Programming Pearls - A Literate Program. CACM, 29(6), pp.471–483.In that paper, the following problem is posed:

Available at: https://www.cs.princeton.edu/courses/archive/spring17/cos333/knuth-mcilroy.pdf.

Given a text file and an integerThis is a deliberately vague problem specification from Bentley. Knuth tightens up the definition, and then gives a several page (42 numbered paragraphs!)k, print thekmost common words in the file (and the number of their occurrences) in decreasing frequency.

accompanied by the following helpful gloss for those unfamiliar with Unix shell arcanae:`(1) tr -cs A-Za-z' ' | (2) tr A-Z a-z | (3) sort | (4) uniq -c | (5) sort -rn | (6) sed ${1}q`

Using Beazley’s ideas, Python generators can give us a midway between extreme Pascal verbosity and extreme Unix terseness, as follows.

- Make one-word lines by transliterating the complement (
`-c`

) of the alphabet into newlines (note the quoted newline), and squeezing out (`-s`

) multiple newlines.- Transliterate upper case to lower case.
- Sort to bring identical words together.
- Replace each run of duplicate words with a single representative and include a count (
`-c`

).- Sort in reverse (
`-r`

) numeric (`-n`

) order.- Pass through a stream editor; quit (
`q`

) after printing the number of lines designated by the script’s first parameter (`${1}`

).

Let’s use some online text files for our test input.

```
textlorum = "http://www.sample-videos.com/text/Sample-text-file-10kb.txt"
textgreatexpectations = "http://www.gutenberg.org/files/1400/1400-0.txt"
```

The Let’s open the file, and print out the first few lines, to see what we have. (The raw input is binary, so we used

`decode()`

to convert to strings;
again to do all this properly is a bit more work, but not the point of this discussion.)```
import urllib.request
infile = urllib.request.urlopen(textlorum)
lines = (line.decode() for line in infile)
# have a peek at the first 3 lines, numbered
lines = print_debug(lines,n=3, sep='\n', end='\n', enum=True)
```

```
(1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis
lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel
vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum
mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus
pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et
erat at mauris tincidunt porttitor vitae ac dui.\r\n')
(2, '\r\n')
(3, 'Donec vulputate lorem tortor, nec fermentum nibh bibendum vel. Lorem ipsum dolor sit
amet, consectetur adipiscing elit. Praesent dictum luctus massa, non euismod lacus.
Pellentesque condimentum dolor est, ut dapibus lectus luctus ac. Ut sagittis commodo arcu.
Integer nisi nulla, facilisis sit amet nulla quis, eleifend suscipit purus. Class aptent
taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.
Aliquam euismod ultrices lorem, sit amet imperdiet est tincidunt vel. Phasellus dictum
justo sit amet ligula varius aliquet auctor et metus. Fusce vitae tortor et nisi pulvinar
vestibulum eget in risus. Donec ante ex, placerat a lorem eget, ultricies bibendum purus.
Nam sit amet neque non ante laoreet rutrum. Nullam aliquet commodo urna, sed ullamcorper
odio feugiat id. Mauris nisi sapien, porttitor in condimentum nec, venenatis eu urna.
Pellentesque feugiat diam est, at rhoncus orci porttitor non.\r\n')
```

So each paragraph in the text file is a single line terminated with `'\r\n'`

, and paragraphs are separated by blank lines. What the first stage of the McIlroy’s Unix pipeline does is output single word lines, by turning everything that is not a letter (here, only `'.,;'`

and newlines) into newlines, and squashing multiple newlines into single ones.
We can do the same by making use of the Python

`re`

(regular expression) module.
We build a generator that takes in a line of data, and yields the words (defined using a regular expression as any sequence of upper and lower case letter) one at a time, translated to lower case (using `lower()`

).
We then embed that in a second generator, to iterate over all the lines of data. ```
import re
words = (w.group(0).lower() for line in lines for w in re.finditer(r"[a-zA-Z]+", line) )
# have a peek at the first 100 words
words = print_debug(words,100)
```

```
lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit, vivamus, condimentum,
sagittis, lacus, laoreet, luctus, ligula, laoreet, ut, vestibulum, ullamcorper, accumsan,
velit, vel, vehicula, proin, tempor, lacus, arcu, nunc, at, elit, condimentum, semper,
nisi, et, condimentum, mi, in, venenatis, blandit, nibh, at, sollicitudin, vestibulum,
dapibus, mauris, at, orci, maximus, pellentesque, nullam, id, elementum, ipsum,
suspendisse, cursus, lobortis, viverra, proin, et, erat, at, mauris, tincidunt,
porttitor, vitae, ac, dui, donec, vulputate, lorem, tortor, nec, fermentum, nibh,
bibendum, vel, lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit, praesent,
dictum, luctus, massa, non, euismod, lacus, pellentesque, condimentum, dolor, est, ut,
dapibus, lectus, luctus, ac, ...
```

So this is generating the words, ignoring newlines and blank lines properly.
We have now done the first two lines of the Unix pipe, with three lines of (in the last case admittedly rather complex!) Python.
Next in McIlroy’s pipeline is the

`sort`

. This is not a generator friendly operation, as it needs the entire file of words in memory to be sorted. But we can do the job of the following line,
`uniq -c`

, without having to sort first, by building a dictionary of the occurrences. We still need to read the entire file to do this, but because we iterate over the `word`

generator, we can do this incrementally. So we don’t need to store the entire file in memory, which becomes important if the file is large.
We can use a

```
from collections import defaultdict
def build_word_dict(word_gen):
# construct and return a dictionary of (word,count) items
wdict = defaultdict(int)
for word in word_gen:
wdict[word] += 1
return wdict
word_dict = build_word_dict(words)
```

Let's have a look at some properties of the constructed dictionary:```
# print words with more than 15 occurrences
print({k:v for k,v in word_dict.items() if v>15})
print('total number of words =', sum([v for k,v in word_dict.items()]))
print('number of unique words =', len(word_dict))
```

```
{'sit': 19, 'sed': 25, 'ut': 18, 'tincidunt': 19, 'metus': 16, 'nulla': 21, 'et': 30,
'amet': 19, 'at': 19, 'ipsum': 21, 'non': 21, 'lorem': 21, 'in': 25, 'pellentesque': 17}
total number of words = 1368
number of unique words = 175
```

Opening up the input file in Chrome and searching for The dictionary (175 items) is

```
k = 10 # the "integer k" in the spec; k most common words
top_word_keys = sorted(word_dict, key=word_dict.get, reverse=True)
for key in islice(top_word_keys,k):
print(word_dict[key], key)
```

```
30 et
25 in
25 sed
21 non
21 lorem
21 nulla
21 ipsum
19 sit
19 tincidunt
19 at
```

And we’re done! Putting it all together, and running on ```
from itertools import *
import urllib.request
import re
from collections import defaultdict
textsource = "http://www.sample-videos.com/text/Sample-text-file-10kb.txt"
textgreatexpectations = "http://www.gutenberg.org/files/1400/1400-0.txt"
k = 20 # the "integer k" in the spec; k most common words
def build_word_dict(word_gen):
# construct and return a dictionary of (word,count) items
wdict = defaultdict(int)
for word in word_gen:
wdict[word] += 1
return wdict
infile = urllib.request.urlopen(textgreatexpectations)
lines = (line.decode() for line in infile)
words = (w.group(0).lower() for line in lines for w in re.finditer(r"[a-zA-Z]+", line) )
word_dict = build_word_dict(words)
top_word_keys = sorted(word_dict, key=word_dict.get, reverse=True)
for key in islice(top_word_keys,k):
print(word_dict[key], key)
```

```
8321 the
7166 and
6666 i
5235 to
4557 of
4112 a
3090 that
3083 in
2837 was
2811 it
2382 you
2263 he
2093 had
2070 my
1998 me
1860 his
1807 with
1786 as
1654 at
1432 on
```

So that’s the problem solved, using generators in many places.
It’s not as compact as the Unix shell script, but it’s still quite remarkably compact.
And it improves the memory usage by not requiring the whole file to be read in to memory at once in order to sort the individual words:
```
print('total number of words =', sum([v for k,v in word_dict.items()]))
print('number of unique words =', len(word_dict))
```

```
total number of words = 192016
number of unique words = 10989
```

Although that’s the problem as stated solved, it might not be what you are expecting.
All the most common words are … common words.
Any large text will give similar results.
We can remove the ‘uninteresting’ common words – called ```
# get full stop word list from https://tinyurl.com/yce6ttyc
stop_words = [ "a", "about", "above", "after", "again", "against", "all", "am", ... ];
word_dict_stopped = { k:v for k,v in word_dict.items() if k not in stop_words }
print('number of stop words =', len(stop_words))
print({k:v for k,v in word_dict_stopped.items() if v>600})
print('total number of words after stop =', sum([v for k,v in word_dict_stopped.items()]))
print('number of unique words after stop =', len(word_dict_stopped))
```

```
number of stop words = 153
{'said': 1349, 't': 682, 'no': 655, 'joe': 747, 'not': 1088, 's': 1194, 'mr': 711}
total number of words after stop = 89529
number of unique words after stop = 10870
```

We can then look at the most frequent words after the stop words have been removed:```
top_word_keys = sorted(word_dict_stopped, key=word_dict.get, reverse=True)
for key in islice(top_word_keys,k):
print(word_dict[key], key)
```

```
1349 said
1194 s
1088 not
747 joe
711 mr
682 t
655 no
514 one
453 now
392 know
383 miss
375 come
374 time
371 little
368 upon
341 pip
327 like
325 looked
321 man
318 havisham
```

That is a bit more interesting: we can see We also see a couple of rather strange ‘words’:

Labels:
Bristol,
complexity

I’m in Bristol, attending a “community consultation workshop” on the *Review of Complexity Science as an EPSRC Research Area*. I travelled down by train last night, and walked to the hotel. Google maps wanted me to cut across some weird footbridge over a river, and wander through back streets. Since it was 10 pm, pitch dark, and very windy, I thanked it kindly, and walked straight up the well-lit main road.

This is the lovely autumnal view that greeted me this morning:

Now off to talk about complexity science, complex systems, systems thinking, and cross-disciplinarity.

This is the lovely autumnal view that greeted me this morning:

Now off to talk about complexity science, complex systems, systems thinking, and cross-disciplinarity.

Labels:
mathematics,
python

In the past, I have rhapsodised over Python’s list comprehensions. I had also heard of its generators, but never looked at them seriously. Recently I have been thinking about a problem in *stream programming*, where I will need the generators’ *lazy evaluation*. So I have been looking at them in more detail. And discovered that (in Python 3 at least) the list comprehension is just syntactic sugar for a generator expression output turned into a list. That is:

With a list comprehension, I can construct, say, a list of the first 10 square numbers.

*unbounded*.

The module

##
Restarting generators

I have been quite careful above to keep redefining

##

Once it’s gone, it’s gone. But what if you want it back? You could always iterate again. But what if the items are expensive to compute? For example, you may be reading a large file, and don’t want to read it all again. Here

##
Sums and products

The

##
Arithmetic series and products

We can use

##
sum of reciprocal squares

$$
\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}
$$

###
sum of reciprocal powers of 2

$$
\sum_{n=1}^\infty \frac{1}{2^n} = 1
$$

###
factorial

$$
\prod_{i=1}^n i = n!
$$

###
a product for +++\pi+++

$$
\prod_{n=1}^\infty \left( \frac{4n^2}{4n^2-1} \right) = \frac{\pi}{2}
$$

`[ <expr> for i in <iter> if <cond> ]`

is just syntactic sugar for `list( <expr> for i in <iter> if <cond> )`

. Let’s look at this in more detail.With a list comprehension, I can construct, say, a list of the first 10 square numbers.

```
squares = [n*n for n in range(1,11)]
print(squares)
```

```
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
```

Since this is just syntactic sugar, it is the same as writing:```
squares = list(n*n for n in range(1,11))
print(squares)
```

```
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
```

So far, so trivial. But the advantage of using the generator form is that it lazily iterates (it "generates") its items one at a time, as they are asked for. If you do not ask, it does not produce. This is useful if you don’t know how many items you want, so cannot specify the stop value of the `range()`

. A generator can be ```
from itertools import *
squares = (n*n for n in count(1))
for s in squares:
print(s, end=", ")
if s >= 100:
print('...')
break
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
```

Here `count()`

imported from `itertools`

is a iterable like `range()`

except that it does not have a stop value. It is a generator that just keeps incrementing until you stop asking it for more values.The module

`itertools`

has many such useful functions you can use with potentially infinite generators. For example, we can `islice()`

off the first few items of a potentially unbounded generator, to give a bounded generator that returns only those items. Here we slice off the first 10 values:```
squares = (n*n for n in count(1))
firstfew = islice(squares,10)
print(list(firstfew))
```

```
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
```

This is fine if we know how many items we want. Sometimes instead we want all the items up to a particular value. We can use `takewhile()`

for this, here to get the squares less than 150.```
squares = (n*n for n in count(1))
firstfew = takewhile(lambda n: n<=150, squares)
print(list(firstfew))
```

```
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144]
```

Let’s use `islice()`

and `takewhile`

to define a couple of "pretty print" functions:```
# print the first n items of the generator, comma separated
def print_for(gen,n=10):
print(*list(islice(gen,n)), sep=', ', end=", ...\n")
# print the generator up to value nmax, comma separated
def print_while(gen,nmax=250):
print(*list(takewhile(lambda n: n<=nmax, gen)), sep=', ', end=", ...\n")
print_for(n*n for n in count(1))
print_while(n*n for n in count(1))
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, ...
```

```
squares = (n*n for n in count(1))
print_for(squares)
squares = (n*n for n in count(1))
print_while(squares)
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, ...
```

There are analogous functions `dropwhile()`

, which drops items until its predicate is `True`

, and `filterfalse()`

, which drops all items for which its predicate is false.```
print_for(dropwhile(lambda n: n<=5, count(1))) # drop items until they get to 5
print_for(filterfalse(lambda n: n%3, count(1))) # filter out items not divisible by 3
```

```
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
3, 6, 9, 12, 15, 18, 21, 24, 27, 30, ...
```

`squares`

. This is because once an item is consumed, it is gone. If I take a generator instance, like `squares`

, and slice off the first ten item, then slice again, I get the next ten items. For example:```
squares = (n*n for n in count(1))
print_for(squares)
print_for(squares) # continues from next value of same generator
print_for(n*n for n in count(1))
print_for(n*n for n in count(1)) # restarts, with new generator
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
121, 144, 169, 196, 225, 256, 289, 324, 361, 400, ...
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
```

You do need to ensure you consume the generated items for this to occur. Consider:```
squares = (n*n for n in count(1))
# slice off the first 10?
islice(squares, 10)
print_for(squares) # maybe not what is expected
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
```

The `islice()`

produced a generator. But since nothing accessed that generator, nothing actually got sliced off `squares`

. Compare:```
squares = (n*n for n in count(1))
# slice off and consume the first 10
list(islice(squares, 10))
print_for(squares)
```

```
121, 144, 169, 196, 225, 256, 289, 324, 361, 400, ...
```

`tee()`

for two`tee()`

is useful for "remembering" earlier items. (Note: this doesn’t make +++n+++ copies of an iterator, rather it gives +++n+++ pointers into a single iterator.)```
sq_ptr = tee((n*n for n in count(1)))
print_for(sq_ptr[0])
print_for(sq_ptr[0],5)
print_for(sq_ptr[1])
print_for(sq_ptr[1])
print_for(sq_ptr[1],5)
print_for(sq_ptr[0])
```

```
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
121, 144, 169, 196, 225, ...
1, 4, 9, 16, 25, 36, 49, 64, 81, 100, ...
121, 144, 169, 196, 225, 256, 289, 324, 361, 400, ...
441, 484, 529, 576, 625, ...
256, 289, 324, 361, 400, 441, 484, 529, 576, 625, ...
```

One application is building a "sliding window" over the iterated data. Here is an example for a window of size 3:```
def triples(iterable):
a, b, c = tee(iterable,3)
next(b)
list(islice(c,2))
return zip(a, b, c)
print_for(triples(count(1)))
print_for(triples('abcdefghijklmnopqrstuvwxyz'))
```

```
(1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9), (8, 9, 10),
(9, 10, 11), (10, 11, 12), ...
('a', 'b', 'c'), ('b', 'c', 'd'), ('c', 'd', 'e'), ('d', 'e', 'f'), ('e', 'f', 'g'),
('f', 'g', 'h'), ('g', 'h', 'i'), ('h', 'i', 'j'), ('i', 'j', 'k'), ('j', 'k', 'l'), ...
```

and here it is for a user-defined window size:```
def sliding_window(iterable,n=2):
windows = tee(iterable,n)
for i in range(n):
list(islice(windows[i],i))
return zip(*windows)
print_for(sliding_window(count(1),4))
print_for(sliding_window('abcdefghijklmnopqrstuvwxyz',3))
```

```
(1, 2, 3, 4), (2, 3, 4, 5), (3, 4, 5, 6), (4, 5, 6, 7), (5, 6, 7, 8), (6, 7, 8, 9),
(7, 8, 9, 10), (8, 9, 10, 11), (9, 10, 11, 12), (10, 11, 12, 13), ...
('a', 'b', 'c'), ('b', 'c', 'd'), ('c', 'd', 'e'), ('d', 'e', 'f'), ('e', 'f', 'g'),
('f', 'g', 'h'), ('g', 'h', 'i'), ('h', 'i', 'j'), ('i', 'j', 'k'), ('j', 'k', 'l'), ...
```

`itertools`

module has a function `accumulate()`

that takes an iterator, and returns a generator that comprises the sum of the values up to that point. We could use this to implement a simple version of `count()`

"the hard way", by using `repeat()`

, which simply repreats its argument endlessly:```
print_for(repeat(3))
print_for(accumulate(repeat(1)))
```

```
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
```

The triangle numbers are `1`

, `1+2`

, `1+2+3`

, ... We can use `count()`

and `accumulate()`

to generate these:```
print_for(accumulate(count(1)))
```

```
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
```

`accumulate()`

defaults to summing the items, but other functions can be used.```
import operator
factorial = accumulate(count(1), operator.mul)
print_for(factorial)
```

```
1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, ...
```

`accumulate()`

to calculate infinite sums and products (or at least, give us the first howevermany terms).```
import math
recip_squares = (1/(n*n) for n in count(1))
print_for(accumulate(recip_squares))
# or all in one go, now with 30 terms:
print_for(accumulate(1/(n*n) for n in count(1)),30)
# it converges rather slowly
print('limit =', math.pi*math.pi/6)
```

```
1.0, 1.25, 1.3611111111111112, 1.4236111111111112, 1.4636111111111112, 1.4913888888888889,
1.511797052154195, 1.527422052154195, 1.5397677311665408, 1.5497677311665408, ...
1.0, 1.25, 1.3611111111111112, 1.4236111111111112, 1.4636111111111112, 1.4913888888888889,
1.511797052154195, 1.527422052154195, 1.5397677311665408, 1.5497677311665408,
1.558032193976458, 1.5649766384209025, 1.5708937981842162, 1.5759958390005426,
1.580440283444987, 1.584346533444987, 1.587806741057444, 1.5908931608105303,
1.5936632439130234, 1.5961632439130233, 1.5984308176091684, 1.6004969333116477,
1.6023872924798896, 1.6041234035910008, 1.6057234035910009, 1.6072026935318293,
1.6085744356443121, 1.6098499458483937, 1.6110390064904865, 1.6121501176015975, ...
limit = 1.6449340668482264
```

```
print_for(accumulate(1/(2**n) for n in count(1)),20)
```

```
0.5, 0.75, 0.875, 0.9375, 0.96875, 0.984375, 0.9921875, 0.99609375, 0.998046875,
0.9990234375, 0.99951171875, 0.999755859375, 0.9998779296875, 0.99993896484375,
0.999969482421875, 0.9999847412109375, 0.9999923706054688, 0.9999961853027344,
0.9999980926513672, 0.9999990463256836, ...
```

```
print_for(accumulate((n for n in count(1)), operator.mul))
```

```
1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, ...
```

```
print_for(accumulate((1/(1-1/(4*n**2)) for n in count(1)), operator.mul),20)
# it also converges rather slowly
print('limit =', math.pi/2)
```

```
1.3333333333333333, 1.422222222222222, 1.4628571428571429, 1.4860770975056687,
1.5010879772784533, 1.51158509600068, 1.5193368144417092, 1.525294998027755,
1.5300172735634447, 1.533851903321749, 1.5370275801402207, 1.539700671583943,
1.5419817096159192, 1.5439510349155563, 1.5456684442981097, 1.5471793616434646,
1.5485189108743247, 1.549714678373069, 1.5507886317191353, 1.5517584807696163, ...
limit = 1.5707963267948966
```

This is all very neat and nifty, and we have barely scratched the surface of what can be done.
But that's probably (more than) enough for now.
Subscribe to:
Posts (Atom)