In my wanderings, I stumbled across Beautiful Soup, a Python library designed to do just that. I had a little spare time today, so I thought I’d investigate it. A little while reading the documentation (yes, really!), and I’d written and tested the <HR> remover:
from bs4 import BeautifulSoup
def delete_final_hr(soup): changed = False all_tags = soup.find_all(True) if not all_tags: # should never be a tag-less file, but... return changed final_tag = all_tags[-1] if final_tag.name == u'hr': final_tag.decompose() changed = True return changedFind all the tags. If the final one is an <HR>, remove it. And that’s it!
Emboldened by this, I thought of the other change I wanted to make: fix up the ALT/TITLE mess in my IMG tags. In the early days of my website, I was using the ALT tag not only as an actual ALT tag, but also as a stand-in TITLE tag. This was because I’d noticed that the ALT text appeared as hover text in the browser I was then using. I’d thought that was how it was supposed to work, because I didn’t read the documentation (yes, really). Later I discovered the difference, and have recently been using ALT and TITLE properly. Also, I hadn’t bothered to give ALT tags to images that are merely decorative, but I’ve recently learned that they should have ALT="" to be properly skipped by screen readers. So I wanted to fix that, too. So, back to Beautiful Soup, to write an <ALT> modifier.
def update_alt(soup): changed = False all_img = soup.find_all('img') for img in all_img: if 'alt' in img.attrs: if 'title' in img.attrs: # both present, do nothing pass elif img['alt'] != '': # non-empty alt only, copy to title img['title'] = img['alt'] changed = True else: if 'title' in img.attrs: # title only, copy to alt img['alt'] = img['title'] changed = True else: # neither; make an empty alt img['alt'] = '' changed = True return changedFind all the image tags. For each tag, if there’s both an ALT and a TITLE, do nothing; if there’s only a (non-empty) ALT, copy its text into the TITLE; if there’s only a TITLE, copy its text into the ALT; if there’s neither, make an empty ALT. And that’s it!
Easy peasy. Now all I had to do was loop over all the files in the directory structure. Naturally, there’s a library for that.
path = '~susan' for root, dirs, files in os.walk(path): for name in files: if name[-4:] == '.htm': changed = False path_name = os.path.join(root, name) with open(path_name) as file: soup = BeautifulSoup(file) changed = delete_final_hr(soup) or changed changed = update_alt(soup) or changed #write out the modified file, in ascii if changed: with open(path_name, 'wb') as file: file.write(convert_to_entity_names(soup))Walk the directory structure. For each file, if it is an html file, read it into soup, delete the final <HR>, update the ALT/TITLE fields; if this has changed the file, write it back out. And that’s it!
Well, actually, that’s not quite it. The thing that gave me the most grief was the unicode aspects. My first attempt to write the file back out resulted in mangled non-breaking spaces and bullet characters, which were being written out in Unicode, not as html entities. A bit of hunting around on the web led as ever to the invaluable Stack Overflow site, which pointed me in the right direction. After a bit of fiddling, I wrote the adequate, but not totally perfect, solution of:
def convert_to_entity_names(soup): ascii = soup.encode('ascii', 'xmlcharrefreplace') ascii = ascii.replace(' ', ' ') return asciiAnd that really is it!
I ran the code over (a copy of) my website. It looked at 3690 files, opened the 1076 ones of them that were html, and updated 853 of them, all in about 5 seconds.
I love Python (and all its libraries).
I’m not so keen on Unicode.