Filtering lines

Franklin Bristow

Filter lines from large text files

Filter lines from large text files using patterns.

We’re going to switch away from find, but we’re going to stick with the general theme of using patterns. Another relatively common task is to find lines within a plain text file that match a certain pattern.

Depending on which courses you’re taking or have taken, this might be a task where you would consider using a loop, but we’re going to take advantage of a tool that’s available on the command line: grep.

What on earth does grep mean?!

Computer Scientists and programmers are bad at naming things (maybe because it’s a “hard thing”). Maybe it is hard, but we’re still bad at naming things, and we should feel bad.

I’m digressing. The name grep comes from a command in the ed editor that would “globally search for a regular epression and print matching lines”.

An unfortunate theme among useful programs in Unix and Unix-like operating systems (like Linux) is that their names aren’t discoverable.

Just finding the lines that match a pattern in a file can be useful, but we’re going to look at a few options for grep that can help give some additional information or context about what we’re looking for:

Basic use: printing out matching lines.
An option to count the number of matching lines.
An option to show lines around the line we’re looking for.
An option to search files recursively.
An option to print the line number that matches the pattern.

Getting some data

Let’s start with a large text file.

The example we’re going to be using here is genetic sequence data. Similar to how Computer Scientists and programmers are bad at naming things, biologists and microbiologists are bad at storing things and they use plain text formats to store genetic sequence data. That’s actually really convenient for us because it gives a realistic data set to work with.

Don’t worry: you don’t need to be a biologist or microbiologist to follow along.

Download this file from NCBI to your user directory on Aviary (use wget or curl):

https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/MonkeyPox.fn

Once you’ve downloaded the file, feel free to take a look at it using your preferred text editor on Aviary (e.g., vim, nano, emacs). This is a FASTA-formatted file. A FASTA-formatted file contains 1 or more “records”, where a record will have a unique identifier that’s meaningful to a biologist or microbiologist, and then the sequence data that corresponds to that identifier. Records look like this:

>unique identifier
CTCTTTCTCTCTTCGATGGGTCTCACAAAAATATTAAACCTCTTTCTGATGGAGTCGTAAAAAGTTTTTA
TCCTTTCTCTCTTCGA

The file you just downloaded is about 15MB in size (there are about 15 million characters in this file), so doing things like counting records is not something we want to do by hand.

Basic use

Let’s start using grep to filter and print out lines that match a certain pattern.

From our crash course on FASTA-formatted files, we know that records have unique identifiers, and the lines with unique identifiers contain or start with the > character. Let’s use that as our filter:

grep '>' MonkeyPox.fn # print all lines in MonkeyPox.fn that
                      # contain the > character

This will print out a bunch of lines (we’ll find out how many real soon!), and all the lines contain the pattern >.

We can actually be more precise with what we want by using an “anchor” for our pattern. Records start with lines where the first character on the line is >. In most FASTA-formatted files, the only place where the > character appears is on the unique identifier line, but it’s possible for it to appear in other places, too.

grep '^>' MonkeyPox.fn # print all lines in MonkeyPox.fn that
                       # **start with** the > character

The ^ (caret, I prefer “hat”) is an “anchor”: “From the start of the line”.

Counting lines

Seeing the lines that match the pattern is useful, but we may also want to know other stuff, like how many lines matched the pattern. Thankfully, grep has an option to help us with that: -c.

grep -c '^>' MonkeyPox.fn # count the lines in MonkeyPox.fn that
                          # start with the > character

This prints out only a number, and the number represents how many lines matched the pattern.

For MonkeyPox.fn, this tells us how many records are in this file.

Showing lines around the matching lines

We can ask grep to find lines that match patterns, and we can also ask grep to show us the lines that are around (before and after) the line that matches the pattern. grep calls this “context”.

We can print the lines that match the pattern, plus the lines immediately after those using the -A option (after):

grep -A 2 '^>' MonkeyPox.fn # print out the record identifier and
                            # 2 lines of sequence data after it.

Similarly, we can print the lines that match the pattern, plus the lines immediately before those using the -B option (before):

grep -B 2 '^>' MonkeyPox.fn # print out the record identifier and
                            # 2 lines of sequence data before it.

We can do both at the same time with the -C option (this is upper-case C, for “context”):

grep -C 2 '^>' MonkeyPox.fn # print out the record identifer and both
                            # 2 lines before and after it.

Filtering recursively

Sometimes you want to search many files in the same directory for a pattern, either because you don’t know which file contains the lines you’re looking for, or because your data is spread across many files.

We’ve seen a “recursive” command before in week 3:

rm -r hello # recursively remove hello and everything within it

The grep command also has a recursive option, and it’s also -r!

Switch back to crazy-directories. We were able to use find to help us find files that have names matching a pattern. Now we want to use grep to find files that contain a specific pattern.

Emoji short codes all follow the same pattern: A colon, followed by some characters, followed by another colon. Here are some emoji and their short codes:

:banana: 🍌
:robot: 🤖
:sparkles: ✨

We can use grep recursively to find all files that contain one or more lines matching the pattern :*: (a colon, followed by any number of characters, followed by another colon).

grep -r ":*:" # note no filename!

Depending on the state you’re in, this is probably going to print out a few more files than you expected, including .docx files.

Let’s talk about what .docx files are: they’re secretly a .zip file. Remember how to unzip .zip files? That’s right! unzip!

Go ahead, change into a directory containing one of the .docx files you created in crazy-directories and unzip it:

unzip robot.md.docx

There are going to be a bunch of new files in the directory that are mostly .xml files. XML is a “markup” language, sort of like Markdown.

Try running grep again recursively on this again and you’ll see that we’re not just matching emoji short codes anymore, but a bunch of weird looking XML. Neat 📷.

Printing matching line numbers

Knowing the name of the file containing the line that matches your pattern is often enough, but grep can help you out a little more by telling you exactly the line number that matched the pattern. You can ask grep to tell you this using the -n option.

If you left the root of crazy-directories, change back to it.

Let’s find all files that contain the pattern :*: again, but we’ll ask for grep to print out the line numbers:

grep -rn ":*:"

# OR

grep -r -n ":*:"

Some command line tools will allow you to combine options after a single - (so -r -n turns into -rn).