Wednesday, 11 December 2013

Beautiful Soup

A while back I changed some of the design of my website to using more CSS.  This left quite a few pages with an <HR> (horizontal rule) at the bottom.  This was the previous way I gave a visual indication of end of the page.  With the new CSS styling, this isn’t needed.  But how to remove them all?  My website has over 21,000 files.  Okay, it’s not that bad: there are only just over 1000 html files that are not auto-generated.  Nevertheless, there’s no way I was going to search and edit 1000 files by hand.  I’d need to write a program.  A program that could parse and edit html.

In my wanderings, I stumbled across Beautiful Soup, a Python library designed to do just that. I had a little spare time today, so I thought I’d investigate it. A little while reading the documentation (yes, really!), and I’d written and tested the <HR> remover:
from bs4 import BeautifulSoup
def delete_final_hr(soup):
    changed = False  
    all_tags = soup.find_all(True)
    if not all_tags:    
        # should never be a tag-less file, but...
        return changed
        
    final_tag = all_tags[-1]
    if final_tag.name == u'hr':
        final_tag.decompose()
        changed = True
    return changed
Find all the tags. If the final one is an <HR>, remove it. And that’s it!

Emboldened by this, I thought of the other change I wanted to make: fix up the ALT/TITLE mess in my IMG tags.  In the early days of my website, I was using the ALT tag not only as an actual ALT tag, but also as a stand-in TITLE tag.  This was because I’d noticed that the ALT text appeared as hover text in the browser I was then using.  I’d thought that was how it was supposed to work, because I didn’t read the documentation (yes, really).  Later I discovered the difference, and have recently been using ALT and TITLE properly.  Also, I hadn’t bothered to give ALT tags to images that are merely decorative, but I’ve recently learned that they should have ALT="" to be properly skipped by screen readers.  So I wanted to fix that, too.  So, back to Beautiful Soup, to write an <ALT> modifier.
def update_alt(soup):
    changed = False  
    all_img = soup.find_all('img')
    for img in all_img:
        if 'alt' in img.attrs:
            if 'title' in img.attrs:
                # both present, do nothing
                pass
            elif img['alt'] != '':
                # non-empty alt only, copy to title
                img['title'] = img['alt']
                changed = True
        else:
            if 'title' in img.attrs:
                # title only, copy to alt
                img['alt'] = img['title']
                changed = True
            else:
                # neither; make an empty alt
                img['alt'] = ''
                changed = True
    return changed
Find all the image tags.  For each tag, if there’s both an ALT and a TITLE, do nothing; if there’s only a (non-empty) ALT, copy its text into the TITLE; if there’s only a TITLE, copy its text into the ALT; if there’s neither, make an empty ALT. And that’s it!

Easy peasy.  Now all I had to do was loop over all the files in the directory structure. Naturally, there’s a library for that.
import os
path = '~susan'
for root, dirs, files in os.walk(path):
    for name in files:
        if name[-4:] == '.htm':
            changed = False
            path_name = os.path.join(root, name)
            with open(path_name) as file:
                soup = BeautifulSoup(file)
                
            changed = delete_final_hr(soup) or changed 
            changed = update_alt(soup) or changed 

            #write out the modified file, in ascii
            if changed:
                with open(path_name, 'wb') as file:
                    file.write(convert_to_entity_names(soup))
Walk the directory structure.  For each file, if it is an html file, read it into soup, delete the final <HR>, update the ALT/TITLE fields; if this has changed the file, write it back out.  And that’s it!

Well, actually, that’s not quite it.  The thing that gave me the most grief was the unicode aspects.  My first attempt to write the file back out resulted in mangled non-breaking spaces and bullet characters, which were being written out in Unicode, not as html entities.  A bit of hunting around on the web led as ever to the invaluable Stack Overflow site, which pointed me in the right direction.  After a bit of fiddling, I wrote the adequate, but not totally perfect, solution of:
def convert_to_entity_names(soup):
    ascii = soup.encode('ascii', 'xmlcharrefreplace')
    ascii = ascii.replace('&#160;', '&nbsp;')
    return ascii
And that really is it!

I ran the code over (a copy of) my website.  It looked at 3690 files, opened the 1076 ones of them that were html, and updated 853 of them, all in about 5 seconds.

I love Python (and all its libraries).

I’m not so keen on Unicode.

No comments:

Post a Comment