--- title: "Filtering lines" author: Franklin Bristow --- Filter lines from large text files ================================== ::: outcomes * [X] Filter lines from large text files using patterns. ::: We're going to switch away from `find`, but we're going to stick with the general theme of using patterns. Another relatively common task is to find lines *within* a plain text file that match a certain pattern. Depending on which courses you're taking or have taken, this might be a task where you would consider using a loop, but we're going to take advantage of a tool that's available on the command line: `grep`. ::: aside What on earth does `grep` mean?! Computer Scientists and programmers are bad at naming things (maybe because it's a "[hard thing]"). Maybe it is hard, but we're still bad at naming things, and [we should feel bad]. I'm digressing. The name `grep` [comes from a command in the `ed` editor] that would "**g**lobally search for a **r**egular **e**pression and **p**rint matching lines". An unfortunate theme among useful programs in Unix and Unix-like operating systems (like Linux) is that their names aren't discoverable. [hard thing]: https://www.martinfowler.com/bliki/TwoHardThings.html [we should feel bad]: https://youtu.be/4mcD5jd-RAU?t=34 [comes from a command in the `ed` editor]: https://en.wikipedia.org/wiki/Grep ::: Just finding the lines that match a pattern in a file can be useful, but we're going to look at a few options for `grep` that can help give some additional information or context about what we're looking for: * Basic use: printing out matching lines. * An option to count the number of matching lines. * An option to show lines around the line we're looking for. * An option to search files recursively. * An option to print the line number that matches the pattern. Getting some data ----------------- Let's start with a large text file. ::: aside The example we're going to be using here is genetic sequence data. Similar to how Computer Scientists and programmers are bad at naming things, biologists and microbiologists are bad at storing things and they use plain text formats to store genetic sequence data. That's actually really convenient for us because it gives a realistic data set to work with. Don't worry: you don't need to be a biologist or microbiologist to follow along. ::: Download this file from [NCBI] to your user directory on Aviary (use `wget` or `curl`): https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/MonkeyPox.fn Once you've downloaded the file, feel free to take a look at it using your preferred text editor on Aviary (e.g., `vim`, `nano`, `emacs`). This is a [FASTA-formatted file]. A FASTA-formatted file contains 1 or more "records", where a record will have a unique identifier that's meaningful to a biologist or microbiologist, and then the sequence data that corresponds to that identifier. Records look like this: ``` >unique identifier CTCTTTCTCTCTTCGATGGGTCTCACAAAAATATTAAACCTCTTTCTGATGGAGTCGTAAAAAGTTTTTA TCCTTTCTCTCTTCGA ``` The file you just downloaded is about 15MB in size (there are about 15 million characters in this file), so doing things like counting records is not something we want to do by hand. [NCBI]: https://www.ncbi.nlm.nih.gov/ [FASTA-formatted file]: https://en.wikipedia.org/wiki/FASTA_format Basic use --------- Let's start using `grep` to filter and print out lines that match a certain pattern. ::: example From our crash course on FASTA-formatted files, we know that records have unique identifiers, and the lines with unique identifiers contain or start with the `>` character. Let's use that as our filter: ```bash grep '>' MonkeyPox.fn # print all lines in MonkeyPox.fn that # contain the > character ``` This will print out a bunch of lines (we'll find out how many real soon!), and all the lines contain the pattern `>`. We can actually be more precise with what we want by using an "anchor" for our pattern. Records start with lines where the **first character** on the line is `>`. In most FASTA-formatted files, the only place where the `>` character appears is on the unique identifier line, but it's possible for it to appear in other places, too. ```bash grep '^>' MonkeyPox.fn # print all lines in MonkeyPox.fn that # **start with** the > character ``` The `^` (caret, I prefer "hat") is an "anchor": "From the start of the line". ::: Counting lines -------------- Seeing the lines that match the pattern is useful, but we may also want to know other stuff, like how many lines matched the pattern. Thankfully, `grep` has an option to help us with that: `-c`. ::: example ```bash grep -c '^>' MonkeyPox.fn # count the lines in MonkeyPox.fn that # start with the > character ``` This prints out *only a number*, and the number represents how many lines matched the pattern. For `MonkeyPox.fn`, this tells us how many records are in this file. ::: Showing lines around the matching lines --------------------------------------- We can ask `grep` to find lines that match patterns, and we can also ask `grep` to show us the lines that are around (before and after) the line that matches the pattern. `grep` calls this "context". ::: example We can print the lines that match the pattern, plus the lines immediately after those using the `-A` option (after): ```bash grep -A 2 '^>' MonkeyPox.fn # print out the record identifier and # 2 lines of sequence data after it. ``` Similarly, we can print the lines that match the pattern, plus the lines immediately *before* those using the `-B` option (before): ```bash grep -B 2 '^>' MonkeyPox.fn # print out the record identifier and # 2 lines of sequence data before it. ``` We can do both at the same time with the `-C` option (this is upper-case C, for "context"): ```bash grep -C 2 '^>' MonkeyPox.fn # print out the record identifer and both # 2 lines before and after it. ``` ::: Filtering recursively --------------------- Sometimes you want to search many files in the same directory for a pattern, either because you don't know which file contains the lines you're looking for, or because your data is spread across many files. We've seen a "recursive" command before in week 3: ```bash rm -r hello # recursively remove hello and everything within it ``` The `grep` command also has a recursive option, and it's also `-r`! ::: example Switch back to `crazy-directories`. We were able to use `find` to help us find files that have names matching a pattern. Now we want to use `grep` to find files that contain a specific pattern. Emoji short codes all follow the same pattern: A colon, followed by some characters, followed by another colon. Here are some emoji and their short codes: * `:banana:` :banana: * `:robot:` :robot: * `:sparkles:` :sparkles: We can use `grep` recursively to find all files that contain one or more lines matching the pattern `:*:` (a colon, followed by any number of characters, followed by another colon). ```bash grep -r ":*:" # note no filename! ``` Depending on the state you're in, this is probably going to print out a few more files than you expected, including `.docx` files. ::: aside Let's talk about what `.docx` files are: they're secretly a `.zip` file. Remember how to `unzip` `.zip` files? That's right! `unzip`! Go ahead, change into a directory containing one of the `.docx` files you created in `crazy-directories` and `unzip` it: ```bash unzip robot.md.docx ``` There are going to be a bunch of new files in the directory that are mostly `.xml` files. XML is a "markup" language, sort of like Markdown. Try running `grep` again recursively on this again and you'll see that we're not just matching emoji short codes anymore, but a bunch of weird looking XML. Neat :camera:. ::: ::: Printing matching line numbers ------------------------------ Knowing the name of the file containing the line that matches your pattern is often enough, but `grep` can help you out a little more by telling you exactly the line number that matched the pattern. You can ask `grep` to tell you this using the `-n` option. ::: example If you left the root of `crazy-directories`, change back to it. Let's find all files that contain the pattern `:*:` again, but we'll ask for `grep` to print out the line numbers: ```bash grep -rn ":*:" # OR grep -r -n ":*:" ``` *Some* command line tools will allow you to combine options after a single `-` (so `-r -n` turns into `-rn`). :::