Before I start using this approach, I’m going to need some debugging help. I like to write-a-line-test-a-line, and I use
print()
statements a lot
(I was weaned on Fortran IV).
But printing a generator’s content consumes that content,
messing up the following code (unless you comment out and rerun,
which is just nasty).We can use
tee()
to build a generator debugging printer.
Tee the generator, print from the first element, and return the second (untouched) element (see my previous post on generators for the definition of print_for()
):from itertools import *
def print_debug(gen, n=10, sep=', ', end=", ...\n", enum=False):
gen_tee = tee(gen)
if enum: # add a count to the items
print(*list(islice(enumerate(gen_tee[0],1),n)), sep=sep, end=end)
else:
print(*list(islice(gen_tee[0],n)), sep=sep, end=end)
return gen_tee[1]
counter = count(1)
print_for(counter)
print_for(counter)
counter = count(1)
counter = print_debug(counter)
print_for(counter)
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
Okay, back to Unix pipelines.
There’s a lovely little paper from three decades ago.Bentley, J., Knuth, D. & McIlroy, D., 1986. Programming Pearls - A Literate Program. CACM, 29(6), pp.471–483.In that paper, the following problem is posed:
Available at: https://www.cs.princeton.edu/courses/archive/spring17/cos333/knuth-mcilroy.pdf.
Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.This is a deliberately vague problem specification from Bentley. Knuth tightens up the definition, and then gives a several page (42 numbered paragraphs!) literate programming solution in Pascal, including a novel data structure, the trie. McIlroy then critiques Knuth’s solution, and suggests a 6 stage Unix pipeline to do the same job:
accompanied by the following helpful gloss for those unfamiliar with Unix shell arcanae:(1) tr -cs A-Za-z' ' | (2) tr A-Z a-z | (3) sort | (4) uniq -c | (5) sort -rn | (6) sed ${1}q
Using Beazley’s ideas, Python generators can give us a midway between extreme Pascal verbosity and extreme Unix terseness, as follows.
- Make one-word lines by transliterating the complement (
-c
) of the alphabet into newlines (note the quoted newline), and squeezing out (-s
) multiple newlines.- Transliterate upper case to lower case.
- Sort to bring identical words together.
- Replace each run of duplicate words with a single representative and include a count (
-c
).- Sort in reverse (
-r
) numeric (-n
) order.- Pass through a stream editor; quit (
q
) after printing the number of lines designated by the script’s first parameter (${1}
).
Let’s use some online text files for our test input.
textlorum = "http://www.sample-videos.com/text/Sample-text-file-10kb.txt"
textgreatexpectations = "http://www.gutenberg.org/files/1400/1400-0.txt"
The lorem ipsum file has the advantage of containing only words without hyphens (no words split over lines) or apostrophes (all words are simply strings of letters) or numbers or other weirdnesses, but there are commas and semicolons and full stops (periods).
The Gutenberg Great Expectations has more clutter, but since this isn’t an exercise in parsing text, I will take the simpler properties for granted.Let’s open the file, and print out the first few lines, to see what we have. (The raw input is binary, so we used
decode()
to convert to strings;
again to do all this properly is a bit more work, but not the point of this discussion.)import urllib.request
infile = urllib.request.urlopen(textlorum)
lines = (line.decode() for line in infile)
# have a peek at the first 3 lines, numbered
lines = print_debug(lines,n=3, sep='\n', end='\n', enum=True)
(1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis
lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel
vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum
mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus
pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et
erat at mauris tincidunt porttitor vitae ac dui.\r\n')
(2, '\r\n')
(3, 'Donec vulputate lorem tortor, nec fermentum nibh bibendum vel. Lorem ipsum dolor sit
amet, consectetur adipiscing elit. Praesent dictum luctus massa, non euismod lacus.
Pellentesque condimentum dolor est, ut dapibus lectus luctus ac. Ut sagittis commodo arcu.
Integer nisi nulla, facilisis sit amet nulla quis, eleifend suscipit purus. Class aptent
taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.
Aliquam euismod ultrices lorem, sit amet imperdiet est tincidunt vel. Phasellus dictum
justo sit amet ligula varius aliquet auctor et metus. Fusce vitae tortor et nisi pulvinar
vestibulum eget in risus. Donec ante ex, placerat a lorem eget, ultricies bibendum purus.
Nam sit amet neque non ante laoreet rutrum. Nullam aliquet commodo urna, sed ullamcorper
odio feugiat id. Mauris nisi sapien, porttitor in condimentum nec, venenatis eu urna.
Pellentesque feugiat diam est, at rhoncus orci porttitor non.\r\n')
So each paragraph in the text file is a single line terminated with '\r\n'
, and paragraphs are separated by blank lines. What the first stage of the McIlroy’s Unix pipeline does is output single word lines, by turning everything that is not a letter (here, only '.,;'
and newlines) into newlines, and squashing multiple newlines into single ones.
We can do the same by making use of the Python
re
(regular expression) module.
We build a generator that takes in a line of data, and yields the words (defined using a regular expression as any sequence of upper and lower case letter) one at a time, translated to lower case (using lower()
).
We then embed that in a second generator, to iterate over all the lines of data. import re
words = (w.group(0).lower() for line in lines for w in re.finditer(r"[a-zA-Z]+", line) )
# have a peek at the first 100 words
words = print_debug(words,100)
lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit, vivamus, condimentum,
sagittis, lacus, laoreet, luctus, ligula, laoreet, ut, vestibulum, ullamcorper, accumsan,
velit, vel, vehicula, proin, tempor, lacus, arcu, nunc, at, elit, condimentum, semper,
nisi, et, condimentum, mi, in, venenatis, blandit, nibh, at, sollicitudin, vestibulum,
dapibus, mauris, at, orci, maximus, pellentesque, nullam, id, elementum, ipsum,
suspendisse, cursus, lobortis, viverra, proin, et, erat, at, mauris, tincidunt,
porttitor, vitae, ac, dui, donec, vulputate, lorem, tortor, nec, fermentum, nibh,
bibendum, vel, lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit, praesent,
dictum, luctus, massa, non, euismod, lacus, pellentesque, condimentum, dolor, est, ut,
dapibus, lectus, luctus, ac, ...
So this is generating the words, ignoring newlines and blank lines properly.
We have now done the first two lines of the Unix pipe, with three lines of (in the last case admittedly rather complex!) Python.
Next in McIlroy’s pipeline is the
sort
. This is not a generator friendly operation, as it needs the entire file of words in memory to be sorted. But we can do the job of the following line,
uniq -c
, without having to sort first, by building a dictionary of the occurrences. We still need to read the entire file to do this, but because we iterate over the word
generator, we can do this incrementally. So we don’t need to store the entire file in memory, which becomes important if the file is large.
We can use a default dictionary to save us needing special code when a new word is discovered.
from collections import defaultdict
def build_word_dict(word_gen):
# construct and return a dictionary of (word,count) items
wdict = defaultdict(int)
for word in word_gen:
wdict[word] += 1
return wdict
word_dict = build_word_dict(words)
Let's have a look at some properties of the constructed dictionary:# print words with more than 15 occurrences
print({k:v for k,v in word_dict.items() if v>15})
print('total number of words =', sum([v for k,v in word_dict.items()]))
print('number of unique words =', len(word_dict))
{'sit': 19, 'sed': 25, 'ut': 18, 'tincidunt': 19, 'metus': 16, 'nulla': 21, 'et': 30,
'amet': 19, 'at': 19, 'ipsum': 21, 'non': 21, 'lorem': 21, 'in': 25, 'pellentesque': 17}
total number of words = 1368
number of unique words = 175
Opening up the input file in Chrome and searching for pellentesque gives 17 hits, so that’s looking good.The dictionary (175 items) is much smaller than the entire file (1368 words), drastically saving on memory. We can now sort on the dictionary values to get a list of words in frequency order, then slice off the top k of them.
k = 10 # the "integer k" in the spec; k most common words
top_word_keys = sorted(word_dict, key=word_dict.get, reverse=True)
for key in islice(top_word_keys,k):
print(word_dict[key], key)
30 et
25 in
25 sed
21 non
21 lorem
21 nulla
21 ipsum
19 sit
19 tincidunt
19 at
And we’re done! Putting it all together, and running on Great Expectations:from itertools import *
import urllib.request
import re
from collections import defaultdict
textsource = "http://www.sample-videos.com/text/Sample-text-file-10kb.txt"
textgreatexpectations = "http://www.gutenberg.org/files/1400/1400-0.txt"
k = 20 # the "integer k" in the spec; k most common words
def build_word_dict(word_gen):
# construct and return a dictionary of (word,count) items
wdict = defaultdict(int)
for word in word_gen:
wdict[word] += 1
return wdict
infile = urllib.request.urlopen(textgreatexpectations)
lines = (line.decode() for line in infile)
words = (w.group(0).lower() for line in lines for w in re.finditer(r"[a-zA-Z]+", line) )
word_dict = build_word_dict(words)
top_word_keys = sorted(word_dict, key=word_dict.get, reverse=True)
for key in islice(top_word_keys,k):
print(word_dict[key], key)
8321 the
7166 and
6666 i
5235 to
4557 of
4112 a
3090 that
3083 in
2837 was
2811 it
2382 you
2263 he
2093 had
2070 my
1998 me
1860 his
1807 with
1786 as
1654 at
1432 on
So that’s the problem solved, using generators in many places.
It’s not as compact as the Unix shell script, but it’s still quite remarkably compact.
And it improves the memory usage by not requiring the whole file to be read in to memory at once in order to sort the individual words:
print('total number of words =', sum([v for k,v in word_dict.items()]))
print('number of unique words =', len(word_dict))
total number of words = 192016
number of unique words = 10989
Although that’s the problem as stated solved, it might not be what you are expecting.
All the most common words are … common words.
Any large text will give similar results.
We can remove the ‘uninteresting’ common words – called stop words – from the dictionary first:# get full stop word list from https://tinyurl.com/yce6ttyc
stop_words = [ "a", "about", "above", "after", "again", "against", "all", "am", ... ];
word_dict_stopped = { k:v for k,v in word_dict.items() if k not in stop_words }
print('number of stop words =', len(stop_words))
print({k:v for k,v in word_dict_stopped.items() if v>600})
print('total number of words after stop =', sum([v for k,v in word_dict_stopped.items()]))
print('number of unique words after stop =', len(word_dict_stopped))
number of stop words = 153
{'said': 1349, 't': 682, 'no': 655, 'joe': 747, 'not': 1088, 's': 1194, 'mr': 711}
total number of words after stop = 89529
number of unique words after stop = 10870
We can then look at the most frequent words after the stop words have been removed:top_word_keys = sorted(word_dict_stopped, key=word_dict.get, reverse=True)
for key in islice(top_word_keys,k):
print(word_dict[key], key)
1349 said
1194 s
1088 not
747 joe
711 mr
682 t
655 no
514 one
453 now
392 know
383 miss
375 come
374 time
371 little
368 upon
341 pip
327 like
325 looked
321 man
318 havisham
That is a bit more interesting: we can see pip and havisham, which are special for this particular text.
Also, there is clearly a lot of conversation: said is the most common non-stop word.We also see a couple of rather strange ‘words’: s and t. This is because the original assumptions of no apostrophes (and also of no hyphens) doesn’t actually hold for the Great Expectations file, so a more sophisticated definition of a ‘word’ is needed. But that’s another story for another time.
No comments:
Post a Comment