Using Linux Tools to get a Word Count of a Subtitle File
Using the GNU-Linux tools wc and grep to get a word-count and word-frequency table of a subtitle (.srt) file.
by Brett Coulstock. .
I use subtitling software to write audio description cues for film and television. In substance, this means using the software to subtitle the gaps between the dialogue instead of the dialogue itself.
At the end of this process I have a .srt file that contains lines like this:
1
00:00:02,875 --> 00:00:03,916
The sun rises over a bleak, sandy desert.
2
00:00:07,833 --> 00:00:08,958
In the distance, kicking up dust and sand, a rider.
3
00:00:13,250 --> 00:00:14,291
It's a circus clown on a giraffe.
So each block consists of:
- The cue number on one line
- The in-point in hours, minutes, seconds, milliseconds, followed by an arrow --> followed by the out-point
- The text of the subtitle
At the end of a project with a large number of cues I was curious about the actual word count.
Now, the program I was using to do the describing doesn't have a facility for showing a word-count. However, the .srt file is just plain text. Linux (and other Unix based operating systems) have a vast array of powerful tools for maniupulating and reporting on plain text files.
First Try: Counting the Words
Linux has a utility called wc
. The manual for wc
describes the function of the program as: print newline, word, and byte counts for each file
. Running it on our subtitle file shows:
$ wc subtitles.srt 11 37 226 subtitles.srt
So: 11 lines, 37 words, and 226 bytes. However, if we count the actual words in the file it comes to 25. wc
is interpreting other text in the file as words
.
We don't want that. We want just the text of the subtitle, not anything else.
Finding the Words with Grep
We need to find the lines that contain the words we want to count. This is easy. They are:
- Lines that do not consist solely of one or more numbers.
- Lines that do not contain the arrow symbol -->
Grep is a tool that searches through text files for a pattern and prints those matching lines. For example:
$ grep "sand" subtitles.srt The sun rises over a bleak, sandy desert. In the distance, kicking up dust and sand, a rider. $ grep "giraffe" subtitles.srt It's a circus clown on a giraffe. $ grep "2" subtitles.srt 00:00:02,875 --> 00:00:03,916 2 00:00:13,250 --> 00:00:14,291
Searching for "sand" returns two lines; searching for "giraffe" returns 1; searching for "2" returns three lines.
When we defined our problem, we defined it in terms of the text we didn't want. So let's find all the things we don't want.
Finding the Cue Numbers
We're not interested in the cue-numbers. This command will return just the lines containing a cue-number. (It matches any line that consists solely of one or more numbers).
$ grep -E "^[1234567890]*$" subtitles.srt 1 2 3
So it returned all the cue-numbers. It also returned all the empty lines (because the pattern actually specifies a line consisting of zero or more numbers. An empty line contains zero numbers so it matches). That's fine, because we don't want those lines either.
Finding the Timing Lines
The other lines we're not interested in are the timings.
Unique to that line is the arrow "-->" characters.
$ grep -E "\-\->" subtitles.srt 00:00:02,875 --> 00:00:03,916 00:00:07,833 --> 00:00:08,958 00:00:13,250 --> 00:00:14,291
Grep pulls the three lines out. If I'm ever interested in doing things with those lines, perhaps calculating the duration, I know grep can isolate them.
Combining the Patterns
Now, to get all those lines we combine the two patterns.
$ grep -E "(\-\->)|(^[0123456789]*$)" subtitles.srt 1 00:00:02,875 --> 00:00:03,916 2 00:00:07,833 --> 00:00:08,958 3 00:00:13,250 --> 00:00:14,291
Hooray, we now have everything in the file that's not what we want.
Grep has a way of reversing itself. You can tell it to return, not all the matching lines, but all the lines that don't match. We just pass it the "-v" option.
$ grep -E -v "(\-\->)|(^[0123456789]*$)" subtitles.srt The sun rises over a bleak, sandy desert. In the distance, kicking up dust and sand, a rider. It's a circus clown on a giraffe.
We now have all the words we want to count!
2nd Try: Counting the Words
We now have the words, but they are not in a file on disk, they are ephemeral, just printed on the screen.
Linux commands can be chained, they can take the output of one and feed it to another. In other words, we can take the the lines isolated output by grep
and feed them to wc
to count the words.
$ grep -E -v "(\-\->)|(^[0123456789]*$)" subtitles.srt | wc -w 25
25 words is right!
Saving the Command
Once you work out a useful command like that, one you might use again, it's a good idea to make it general so anytime I want a word count of a srt file, I can just say:
$ srtcount subtitles.srt 25
We can save this as a script. It looks like this:
#! /usr/bin/bash
# Find all the lines of text in a subtitle file and count the words
# (lines that aren't just numbers; lines that don't contain the arrow symbol)
grep -E -v "(\-\->)|(^[0123456789]*$)" $1 | wc -w
It's good practice to write some kind of comment that describes what the script does and roughly how it works.
Then we need to make it executable, and put it in a place where the system looks for programs it can execute, and run it on an actual real subtitle file, not the dummy file for this example:
$ srtcount Doctor-Who-The-Rescue-Episode-01-Audio-Description-Script.srt 907
This all looks like complicated time-intensive voodoo, but it's just one step at a time, and then packaged in a way that makes all the work useful for the future.
Supplemental : Word Frequency
We have the word count, but what about finding the words that are used. Here's a complex example chaining a number of linux text-manipulation utilities together.
The dummy file above yeilds a very boring frequency table, so I'll run it on another real file.
$ grep -E -o '[[:alpha:]]{2,}' Doctor-Who-The-Rescue-Episode-01-Audio-Description-Script.srt | tr [:upper:] [:lower:] | sort | uniq -c | sort -nr 104 the 33 and 19 of 19 in 13 vicki 13 barbara 12 on 12 cave 11 ian 10 with 10 doctor 10 creature 9 to 9 she 8 tardis 8 it 8 into 8 door 7 her 7 from 7 at 6 out 6 looks 6 is 6 he 5 open 5 his 5 control
I truncated the table, but one thing that leaps out at me is that there are more occurances of she/her than he/him. Interesting!
References
This one was cobbled together from some questions on Stack Exchange:
- command that will take a file and separate each word so its on its own line
- Listing all words in a text file and finding the most frequent word