Counting Words with Python

Posted on Thu 19 March 2020 in Python • 5 min read

Recursively counting words in markdown within a folder

In the last 55 posts, I've written 34846 words in this blog. In comparison, my engineering thesis for graduating university was 9916 words across 69 pages.

A question had popped into my mind this week, "How many words have I written in this blog?". As per normal, we'll go into how to use Python to solve this problem. This is an opportunity to get more familiar with pathlib, a renowned, enjoyable way to handle pathing within Python programs.

In [1]:
import os
import re
import pathlib
import io
from typing import List

A great way to start projects, is to see if it's already been done before. @gandreadis over at GitHub had already tackled this problem on the single file basis: https://github.com/gandreadis/markdown-word-count.

This means all we need to do is:

  • Recursively loop through a folder and find all the markdown files
  • Call @gandreadis's function to get the count of words for each file
  • Sum total words

First of let's bring @gandreadis function into our program. It's been modified slightly to take in a file path, read the file contents in a variable, clean up contents and return the length of an array split by spaces.

For example if the sentence (string) is 'How many words have I written in this blog?', when .split is called this returns an array of each word split by a space. Allowing us to calculate how many words, by counting the number of elements in the array.

In [2]:
testSentence = 'How many words have I written in this blog?'
print(testSentence.split())
['How', 'many', 'words', 'have', 'I', 'written', 'in', 'this', 'blog?']
In [3]:
# Source: https://github.com/gandreadis/markdown-word-count
def count_words_in_markdown(filePath: str):
    with open(filePath, 'r', encoding='utf8') as f:
        text = f.read()

    # Comments
    text = re.sub(r'<!--(.*?)-->', '', text, flags=re.MULTILINE)
    # Tabs to spaces
    text = text.replace('\t', '    ')
    # More than 1 space to 4 spaces
    text = re.sub(r'[ ]{2,}', '    ', text)
    # Footnotes
    text = re.sub(r'^\[[^]]*\][^(].*', '', text, flags=re.MULTILINE)
    # Indented blocks of code
    text = re.sub(r'^( {4,}[^-*]).*', '', text, flags=re.MULTILINE)
    # Replace newlines with spaces for uniform handling
    text = text.replace('\n', ' ')
    # Custom header IDs
    text = re.sub(r'{#.*}', '', text)
    # Remove images
    text = re.sub(r'!\[[^\]]*\]\([^)]*\)', '', text)
    # Remove HTML tags
    text = re.sub(r'</?[^>]*>', '', text)
    # Remove special characters
    text = re.sub(r'[#*`~\-–^=<>+|/:]', '', text)
    # Remove footnote references
    text = re.sub(r'\[[0-9]*\]', '', text)
    # Remove enumerations
    text = re.sub(r'[0-9#]*\.', '', text)

    return len(text.split())

We start by defining the 'top' directory to search within. We call this the 'top' folder as we go 'down' inside more folders inside the 'top' folder. Within pathlib, we can access elements within using glob. A great resource for testing glob patterns, as they are not very intuitive for me personally is https://globster.xyz/.

The glob pattern elements we will use are:

  • ** - Feature known as globstar. Matches all files and zero or more directories and subdirectories.
  • * - Matches any string.
  • ? - Matches single character.
  • .md - Matches files with a .md suffix.

For this particular application, the posts are structured within folders which represent the year they are written so we use the 20??/ glob pattern as to exclude other folders.

In [4]:
# Top directory to search through
topFolder: pathlib.Path = pathlib.Path('C:/Users/jackm/Documents/GitHub/jackmckew.dev/content')

allMarkdown: List  = []

# Iterate through all files using pathlib
for singleFile in topFolder.glob('**/20??/**/*.md'):
    allMarkdown.append(singleFile)

# If you don't want to use glob patterns, can easily use the 'suffix' within the path variable.
# for singleFile in topFolder.glob('**/*'):
#     if singleFile.suffix == '.md':
#         allMarkdown.append(singleFile)

print(len(allMarkdown))
55

We can see that there is 55 markdown files within the folder and all the paths are stored within the list.

All we need to do now is loop over all of them, passing the path into a function defined previously to get the total word count.

In [5]:
totalWordCountMarkdown: int = 0
for singleFile in allMarkdown:
    totalWordCountMarkdown += count_words_in_markdown(singleFile)

print(totalWordCountMarkdown)
31380

There you have it! 31380 words across all the markdown files. In comparison, my engineering thesis for graduating university was 9916 words across 69 pages.

What about Notebooks?

This post has been written in a Jupyter notebook, these files (.ipynb) are formatted at the base level as json files. This means that the tool we've just created won't capture any of the Jupyter notebooks within the folder, this will not stand!

When working with Jupyter notebooks, everything is broken into 'cells'. Cells can be either be markdown, heading, code or output cells.

  • Markdown cells contain the written explanation or notes around some code
  • Heading cells (denoted by #) allow for navigatable headings
  • Code cells contain runnable code through a runtime
  • Output cells contain the output from the code cell that precedes it

So let's create a function which will count all the markdown cells within a Jupyter notebook's markdown cells.

We follow these steps:

  1. Read the notebook as a json files
  2. Loop through all the cells within the notebook
  3. If the cell type matches the type we want, count the words using a similar method
  4. Sum all the counts to get the total for the notebook
In [6]:
from nbformat import current

def count_words_in_jupyter(filePath: str, returnType:str = 'markdown'):
    with io.open(filePath, 'r', encoding='utf-8') as f:
        nb = current.read(f, 'json')

    word_count_markdown: int = 0
    word_count_heading: int = 0
    word_count_code: int = 0
    for cell in nb.worksheets[0].cells:
        if cell.cell_type == "markdown":
            word_count_markdown += len(cell['source'].replace('#', '').lstrip().split(' '))
        elif cell.cell_type == "heading":
            word_count_heading += len(cell['source'].replace('#', '').lstrip().split(' '))
        elif cell.cell_type == "code":
            word_count_code += len(cell['input'].replace('#', '').lstrip().split(' '))

    if returnType == 'markdown':
        return word_count_markdown
    elif returnType == 'heading':
        return word_count_heading
    elif returnType == 'code':
        return word_count_code
    else:
        return Exception

Now similar to our markdown loop, let's reproduce this looking for & counting words inside Jupyter notebooks.

Jupyter creates checkpoint.ipynb files contain a snapshot of the notebook each time it is manually saved, implementing an inbuilt basic version control system. We don't want to count the words in this as it'll be counting the same files over and over again.

In [7]:
# Top directory to search through
topFolder: pathlib.Path = pathlib.Path('C:/Users/jackm/Documents/GitHub/jackmckew.dev/content')

allJupyter: List  = []

# Iterate through all files using pathlib
for singleFile in topFolder.glob('**/20??/**/*.ipynb'):
    if 'checkpoint' not in singleFile.name:
        allJupyter.append(singleFile)

print(len(allJupyter))
5

We found 5 notebooks! Now let's loop over them and count all the files in the markdown cells.

In [8]:
totalWordCountJupyter: int = 0
for singleFile in allJupyter:
    totalWordCountJupyter += count_words_in_jupyter(singleFile,'markdown')

print(totalWordCountJupyter)
3466

There you have it! 3466 words across all the Jupyter notebooks.

This brings our total to 34846!

In [9]:
print(totalWordCountJupyter + totalWordCountMarkdown)
34846

Out of interest, let's see how many words live in code within the Jupyter notebooks.

In [10]:
totalWordCountJupyter: int = 0
for singleFile in allJupyter:
    totalWordCountJupyter += count_words_in_jupyter(singleFile,'code')

print(totalWordCountJupyter)
3292