Using Linux Tools to get a Word Count of a Subtitle File

Using the GNU-Linux tools wc and grep to get a word-count and word-frequency table of a subtitle (.srt) file.

A small, round photo of a smiling man with a mop of dark hair and a pointy nose.

by Brett Coulstock. June, 2023.

I use subtitling software to write audio description cues for film and television. In substance, this means using the software to subtitle the gaps between the dialogue instead of the dialogue itself.

At the end of this process I have a .srt file that contains lines like this:

 

1
00:00:02,875 --> 00:00:03,916
The sun rises over a bleak, sandy desert.

2
00:00:07,833 --> 00:00:08,958
In the distance, kicking up dust and sand, a rider.

3
00:00:13,250 --> 00:00:14,291
It's a circus clown on a giraffe.

So each block consists of:

The cue number on one line
The in-point in hours, minutes, seconds, milliseconds, followed by an arrow --> followed by the out-point
The text of the subtitle

At the end of a project with a large number of cues I was curious about the actual word count.

Now, the program I was using to do the describing doesn't have a facility for showing a word-count. However, the .srt file is just plain text. Linux (and other Unix based operating systems) have a vast array of powerful tools for maniupulating and reporting on plain text files.

First Try: Counting the Words

Linux has a utility called wc. The manual for wc describes the function of the program as: print newline, word, and byte counts for each file. Running it on our subtitle file shows:


$ wc subtitles.srt 
11 37 226 subtitles.srt

So: 11 lines, 37 words, and 226 bytes. However, if we count the actual words in the file it comes to 25. wc is interpreting other text in the file as words.

We don't want that. We want just the text of the subtitle, not anything else.

Finding the Words with Grep

We need to find the lines that contain the words we want to count. This is easy. They are:

Lines that do not consist solely of one or more numbers.
Lines that do not contain the arrow symbol -->

Grep is a tool that searches through text files for a pattern and prints those matching lines. For example:


$ grep "sand" subtitles.srt 
The sun rises over a bleak, sandy desert.
In the distance, kicking up dust and sand, a rider.

$ grep "giraffe" subtitles.srt 
It's a circus clown on a giraffe.

$ grep "2" subtitles.srt 
00:00:02,875 --> 00:00:03,916
2
00:00:13,250 --> 00:00:14,291

Searching for "sand" returns two lines; searching for "giraffe" returns 1; searching for "2" returns three lines.

When we defined our problem, we defined it in terms of the text we didn't want. So let's find all the things we don't want.

Finding the Cue Numbers

We're not interested in the cue-numbers. This command will return just the lines containing a cue-number. (It matches any line that consists solely of one or more numbers).


$ grep -E "^[1234567890]*$" subtitles.srt
1

2

3

So it returned all the cue-numbers. It also returned all the empty lines (because the pattern actually specifies a line consisting of zero or more numbers. An empty line contains zero numbers so it matches). That's fine, because we don't want those lines either.

Finding the Timing Lines

The other lines we're not interested in are the timings.

Unique to that line is the arrow "-->" characters.


$ grep -E "\-\->" subtitles.srt
00:00:02,875 --> 00:00:03,916
00:00:07,833 --> 00:00:08,958
00:00:13,250 --> 00:00:14,291

Grep pulls the three lines out. If I'm ever interested in doing things with those lines, perhaps calculating the duration, I know grep can isolate them.

Combining the Patterns

Now, to get all those lines we combine the two patterns.


$ grep -E "(\-\->)|(^[0123456789]*$)" subtitles.srt
1
00:00:02,875 --> 00:00:03,916

2
00:00:07,833 --> 00:00:08,958

3
00:00:13,250 --> 00:00:14,291

Hooray, we now have everything in the file that's not what we want.

Grep has a way of reversing itself. You can tell it to return, not all the matching lines, but all the lines that don't match. We just pass it the "-v" option.


$ grep -E -v "(\-\->)|(^[0123456789]*$)" subtitles.srt
The sun rises over a bleak, sandy desert.
In the distance, kicking up dust and sand, a rider.
It's a circus clown on a giraffe.

We now have all the words we want to count!

2nd Try: Counting the Words

We now have the words, but they are not in a file on disk, they are ephemeral, just printed on the screen.

Linux commands can be chained, they can take the output of one and feed it to another. In other words, we can take the the lines isolated output by grep and feed them to wc to count the words.


$ grep -E -v "(\-\->)|(^[0123456789]*$)" subtitles.srt | wc -w
25

25 words is right!

Saving the Command

Once you work out a useful command like that, one you might use again, it's a good idea to make it general so anytime I want a word count of a srt file, I can just say:


$ srtcount subtitles.srt 
25

We can save this as a script. It looks like this:


#! /usr/bin/bash

# Find all the lines of text in a subtitle file and count the words
# (lines that aren't just numbers; lines that don't contain the arrow symbol)

grep -E -v "(\-\->)|(^[0123456789]*$)" $1 | wc -w

It's good practice to write some kind of comment that describes what the script does and roughly how it works.

Then we need to make it executable, and put it in a place where the system looks for programs it can execute, and run it on an actual real subtitle file, not the dummy file for this example:


$ srtcount Doctor-Who-The-Rescue-Episode-01-Audio-Description-Script.srt
907

This all looks like complicated time-intensive voodoo, but it's just one step at a time, and then packaged in a way that makes all the work useful for the future.

Supplemental : Word Frequency

We have the word count, but what about finding the words that are used. Here's a complex example chaining a number of linux text-manipulation utilities together.

The dummy file above yeilds a very boring frequency table, so I'll run it on another real file.


$ grep -E -o '[[:alpha:]]{2,}' Doctor-Who-The-Rescue-Episode-01-Audio-Description-Script.srt | tr [:upper:] [:lower:] | sort | uniq -c | sort -nr 

104 the
33 and
19 of
19 in
13 vicki
13 barbara
12 on
12 cave
11 ian
10 with
10 doctor
10 creature
9 to
9 she
8 tardis
8 it
8 into
8 door
7 her
7 from
7 at
6 out
6 looks
6 is
6 he
5 open
5 his
5 control

I truncated the table, but one thing that leaps out at me is that there are more occurances of she/her than he/him. Interesting!

References

This one was cobbled together from some questions on Stack Exchange:

Filed under: Linux and Programming