This lesson is my customized version of the Software Carpentry lesson on The Unix Shell. Please see the original lesson for a more detailed walkthrough
For our purposes the shell is ‘that box where you type in commands’. It is a command-line interface that looks like DOS (if you’re old enough to remember DOS). It has many names and ‘flavours’; Windows users have the Command Prompt, Mac and Unix/Linux users will have the terminal. In this workshop we will learn the BASH shell because it is the most popular shell language as well as being the default for Unix/Linux/Mac’s (and can be easily installed on Windows machines).
Computer scientists will sometimes give the argument that ‘everyone’ should learn to use the shell because that ‘came first’. That argument does not fly with me! I am not a computer scientist and if it does not help me with my own research, I do not care!
We are here learning about the shell, so it must be helpful and here’s why:
Learning Objectives
- Create a directory hierarchy that matches a given diagram.
- Create files in that hierarchy using an editor or by copying and renaming existing files.
- Display the contents of a directory using the command line.
- Delete specified files and/or directories.
Let’s create a new directory called thesis
using the command mkdir thesis
(which has no output):
$ mkdir thesis
As you might (or might not) guess from its name, mkdir
means “make directory”. Since thesis
is a relative path (i.e., doesn’t have a leading slash), the new directory is created in the current working directory:
$ ls -F
creatures/ north-pacific-gyre/ thesis/
data/ notes.txt writing/
Desktop/ pizza.cfg
molecules/ solar.pdf
However, there’s nothing in it yet:
$ ls -F thesis
Let’s change our working directory to thesis
using cd
, then run a text editor called Nano to create a file called draft.txt
:
$ cd thesis
$ nano draft.txt
Let’s type in a few lines of text. Once we’re happy with out text, we can press Ctrl-O (press the Ctrl key and, while holding it down, press the O key) to write our data to disk.
Once our file is saved, we can use Ctrl-X to quit the editor and return to the shell.
nano
doesn’t leave any output on the screen after it exits, but ls
now shows that we have created a file called draft.txt
:
$ ls
draft.txt
Let’s tidy up by running rm draft.txt
:
$ rm draft.txt
This command removes files (rm
is short for “remove”). If we run ls
again, its output is empty once more, which tells us that our file is gone:
$ ls
WARNING:Deleting Is Forever
The Unix shell doesn’t have a trash bin that we can recover deleted files from (though most graphical interfaces to Unix do). Instead, when we delete files, they are unhooked from the file system so that their storage space on disk can be recycled. Tools for finding and recovering deleted files do exist, but there’s no guarantee they’ll work in any particular situation, since the computer may recycle the file’s disk space right away.
DO NOT EVER USE THIS COMMAND
$ rm -rf /
I was nervous just typing it in this document! It is important to take the time to understand why this will delete you entire harddrive. The
rm
command will delete everything in/
, which is your root directory so it contains everything, the-rf
flags are 2 seperate toggles:-r
which forces recursive deletion through all subdirectories and-f
which forces deletion of read-only files without confirmation.
Let’s re-create that file and then move up one directory to /Users/nelle
using cd ..
:
$ pwd
/Users/nelle/thesis
$ nano draft.txt
$ ls
draft.txt
$ cd ..
If we try to remove the entire thesis
directory using rm thesis
, we get an error message:
$ rm thesis
rm: cannot remove `thesis': Is a directory
This happens because rm
only works on files, not directories. The right command is rmdir
, which is short for “remove directory”. It doesn’t work yet either, though, because the directory we’re trying to remove isn’t empty:
$ rmdir thesis
rmdir: failed to remove `thesis': Directory not empty
This little safety feature can save you a lot of grief, particularly if you are a bad typist. To really get rid of thesis
we must first delete the file draft.txt
:
$ rm thesis/draft.txt
The directory is now empty, so rmdir
can delete it:
$ rmdir thesis
With Great Power Comes Great Responsibility
Removing the files in a directory just so that we can remove the directory quickly becomes tedious. Instead, we can use
rm
with the-r
flag (which stands for “recursive”):$ rm -r thesis
This removes everything in the directory, then the directory itself. If the directory contains sub-directories,
rm -r
does the same thing to them, and so on. It’s very handy, but can do a lot of damage if used without care.
Let’s create that directory and file one more time. (Note that this time we’re running nano
with the path thesis/draft.txt
, rather than going into the thesis
directory and running nano
on draft.txt
there.)
$ pwd
/Users/nelle
$ mkdir thesis
$ nano thesis/draft.txt
$ ls thesis
draft.txt
draft.txt
isn’t a particularly informative name, so let’s change the file’s name using mv
, which is short for “move”:
$ mv thesis/draft.txt thesis/quotes.txt
The first parameter tells mv
what we’re “moving”, while the second is where it’s to go. In this case, we’re moving thesis/draft.txt
to thesis/quotes.txt
, which has the same effect as renaming the file. Sure enough, ls
shows us that thesis
now contains one file called quotes.txt
:
$ ls thesis
quotes.txt
One has to be careful when specifying the target file name, since mv
will silently overwrite any existing file with the same name, which could lead to data loss. An additional flag, mv -i
(or mv --interactive
), can be used to make mv
ask you for confirmation before overwriting.
Just for the sake of inconsistency, mv
also works on directories — there is no separate mvdir
command.
Let’s move quotes.txt
into the current working directory. We use mv
once again, but this time we’ll just use the name of a directory as the second parameter to tell mv
that we want to keep the filename, but put the file somewhere new. (This is why the command is called “move”.) In this case, the directory name we use is the special directory name .
that we mentioned earlier.
$ mv thesis/quotes.txt .
The effect is to move the file from the directory it was in to the current working directory. ls
now shows us that thesis
is empty:
$ ls thesis
Further, ls
with a filename or directory name as a parameter only lists that file or directory. We can use this to see that quotes.txt
is still in our current directory:
$ ls quotes.txt
quotes.txt
The cp
command works very much like mv
, except it copies a file instead of moving it. We can check that it did the right thing using ls
with two paths as parameters — like most Unix commands, ls
can be given thousands of paths at once:
$ cp quotes.txt thesis/quotations.txt
$ ls quotes.txt thesis/quotations.txt
quotes.txt thesis/quotations.txt
To prove that we made a copy, let’s delete the quotes.txt
file in the current directory and then run that same ls
again.
$ rm quotes.txt
$ ls quotes.txt thesis/quotations.txt
ls: cannot access quotes.txt: No such file or directory
thesis/quotations.txt
This time it tells us that it can’t find quotes.txt
in the current directory, but it does find the copy in thesis
that we didn’t delete.
What’s In A Name?
You may have noticed that all of Nelle’s files’ names are “something dot something”, and in this part of the lesson, we always used the extension
.txt
. This is just a convention: we can call a filemythesis
or almost anything else we want. However, most people use two-part names most of the time to help them (and their programs) tell different kinds of files apart. The second part of such a name is called the filename extension, and indicates what type of data the file holds:.txt
signals a plain text file,.cfg
is a configuration file full of parameters for some program or other,.png
is a PNG image, and so on.This is just a convention, albeit an important one. Files contain bytes: it’s up to us and our programs to interpret those bytes according to the rules for plain text files, PDF documents, configuration files, images, and so on.
Naming a PNG image of a whale as
whale.mp3
doesn’t somehow magically turn it into a recording of whalesong, though it might cause the operating system to try to open it with a music player when someone double-clicks it.
Renaming files
Suppose that you created a
.txt
file in your current directory to contain a list of the statistical tests you will need to do to analyze your data, and named it:statstics.txt
After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so?
cp statstics.txt statistics.txt
mv statstics.txt statistics.txt
mv statstics.txt .
cp statstics.txt .
Moving and Copying
What is the output of the closing
ls
command in the sequence shown below?$ pwd /Users/jamie/data $ ls proteins.dat $ mkdir recombine $ mv proteins.dat recombine $ cp recombine/proteins.dat ../proteins-saved.dat $ ls
proteins-saved.dat recombine
recombine
proteins.dat recombine
proteins-saved.dat
Organizing Directories and Files
Jamie is working on a project and she sees that her files aren’t very well organized:
$ ls -F analyzed/ fructose.dat raw/ sucrose.dat
The
fructose.dat
andsucrose.dat
files contain output from her data analysis. What command(s) covered in this lesson does she need to run so that the commands below will produce the output shown?$ ls -F analyzed/ raw/ $ ls analyzed fructose.dat sucrose.dat
Copy with Multiple Filenames
What does
cp
do when given several filenames and a directory name, as in:$ mkdir backup $ cp thesis/citations.txt thesis/quotations.txt backup
What does
cp
do when given three or more filenames, as in:$ ls -F intro.txt methods.txt survey.txt $ cp intro.txt methods.txt survey.txt
Listing Recursively and By Time
The command
ls -R
lists the contents of directories recursively, i.e., lists their sub-directories, sub-sub-directories, and so on in alphabetical order at each level. The commandls -t
lists things by time of last change, with most recently changed files or directories first. In what order doesls -R -t
display things?
Learning Objectives
- Redirect a command’s output to a file.
- Process a file instead of keyboard input using redirection.
- Construct command pipelines with two or more stages.
- Explain what usually happens if a program or pipeline isn’t given any input to process.
- Explain Unix’s “small pieces, loosely joined” philosophy.
Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways. We’ll start with a directory called molecules
that contains six files describing some simple organic molecules. The .pdb
extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.
$ ls molecules
cubane.pdb ethane.pdb methane.pdb
octane.pdb pentane.pdb propane.pdb
Let’s go into that directory with cd
and run the command wc *.pdb
. wc
is the “word count” command: it counts the number of lines, words, and characters in files. The *
in *.pdb
matches zero or more characters, so the shell turns *.pdb
into a list of all .pdb
files in the current directory:
$ cd molecules
$ wc *.pdb
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
Wildcards
*
is a wildcard. It matches zero or more characters, so*.pdb
matchesethane.pdb
,propane.pdb
, and every file that ends with ‘.pdb’. On the other hand,p*.pdb
only matchespentane.pdb
andpropane.pdb
, because the ‘p’ at the front only matches filenames that begin with the letter ‘p’.
?
is also a wildcard, but it only matches a single character. This means thatp?.pdb
matchespi.pdb
orp5.pdb
, but notpropane.pdb
. We can use any number of wildcards at a time: for example,p*.p?*
matches anything that starts with a ‘p’ and ends with ‘.’, ‘p’, and at least one more character (since the?
has to match one character, and the final*
can match any number of characters). Thus,p*.p?*
would matchpreferred.practice
, and evenp.pi
(since the first*
can match no characters at all), but notquality.practice
(doesn’t start with ‘p’) orpreferred.p
(there isn’t at least one character after the ‘.p’).When the shell sees a wildcard, it expands the wildcard to create a list of matching filenames before running the command that was asked for. As an exception, if a wildcard expression does not match any file, Bash will pass the expression as a parameter to the command as it is. For example typing
ls *.pdf
in themolecules
directory (which contains only files with names ending with.pdb
) results in an error message that there is no file calledwc
andls
see the lists of file names matching these expressions, but not the wildcards themselves. It is the shell, not the other programs, that deals with expanding wildcards, and this is another example of orthogonal design.
Challenge: Using wildcards
When run in the
molecules
directory, whichls
command will produce this output?
ethane.pdb methane.pdb
ls *t*ane.pdb
ls *t?ne.*
ls *t??ne.pdb
ls ethane.*
If we run wc -l
instead of just wc
, the output shows only the number of lines per file:
$ wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
We can also use -w
to get only the number of words, or -c
to get only the number of characters.
Which of these files is shortest? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:
$ wc -l *.pdb > lengths.txt
The greater than symbol, >
, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc
would have printed has gone into the file lengths.txt
instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution.
We can now send the content of lengths.txt
to the screen using cat lengths.txt
. cat
stands for “concatenate”: it prints the contents of files one after another. There’s only one file in this case, so cat
just shows us what it contains:
$ cat lengths.txt
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
Now let’s use the sort
command to sort its contents. We will also use the -n
flag to specify that the sort is numerical instead of alphabetical. This does not change the file; instead, it sends the sorted result to the screen:
$ sort -n lengths.txt
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
We can put the sorted list of lines in another temporary file called sorted-lengths.txt
by putting > sorted-lengths.txt
after the command, just as we used > lengths.txt
to put the output of wc
into lengths.txt
. Once we’ve done that, we can run another command called head
to get the first few lines in sorted-lengths.txt
:
$ sort -n lengths.txt > sorted-lengths.txt
$ head -n 1 sorted-lengths.txt
9 methane.pdb
Using the parameter -n 1
with head
tells it that we only want the first line of the file; -n 20
would get the first 20, and so on. Since sorted-lengths.txt
contains the lengths of our files ordered from least to greatest, the output of head
must be the file with the fewest lines.
If you think this is confusing, you’re in good company: even once you understand what wc
, sort
, and head
do, all those intermediate files make it hard to follow what’s going on.
We can make it easier to understand by running sort
and head
together:
$ sort -n lengths.txt | head -n 1
9 methane.pdb
The vertical bar, |
, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right. The computer might create a temporary file if it needs to, or copy data from one program to the other in memory, or something else entirely; we don’t have to know or care.
Nothing prevents us from chaining pipes consecutively. That is, we can for example send the output of wc
directly to sort
, and then the resulting output to head
. Thus we first use a pipe to send the output of wc
to sort
:
$ wc -l *.pdb | sort -n
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
And now we send the output ot this pipe, through another pipe, to head
, so that the full pipeline becomes:
$ wc -l *.pdb | sort -n | head -n 1
9 methane.pdb
Challenge: Piping commands together
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
Challenge: Pipe reading comprehension
A file called
animals.txt
contains the following data:2012-11-05,deer 2012-11-05,rabbit 2012-11-05,raccoon 2012-11-06,rabbit 2012-11-06,deer 2012-11-06,fox 2012-11-07,rabbit 2012-11-07,bear
What text passes through each of the pipes and the final redirect in the pipeline below?
cat animals.txt | head -n 5 | tail -n 3 | sort -r > final.txt
Loops are key to productivity improvements through automation as they allow us to execute commands repetitively. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). Suppose we have several hundred genome data files named basilisk.dat
, unicorn.dat
, and so on. In this example, we’ll use the creatures
directory which only has two example files, but the principles can be applied to many many more files at once. We would like to modify these files, but also save a version of the original files, naming the copies original-basilisk.dat
and original-unicorn.dat
. We can’t use:
$ cp *.dat original-*.dat
because that would expand to:
$ cp basilisk.dat unicorn.dat original-*.dat
This wouldn’t back up our files, instead we get an error:
cp: target `original-*.dat' is not a directory
This problem arises when cp
receives more than two inputs. When this happens, it expects the last input to be a directory where it can copy all the files it was passed. Since there is no directory named original-*.dat
in the creatures
directory we get an error.
Instead, we can use a loop to do some operation once for each thing in a list. Here’s a simple example that displays the first three lines of each file in turn:
$ for filename in basilisk.dat unicorn.dat
> do
> head -n 3 $filename
> done
COMMON NAME: basilisk
CLASSIFICATION: basiliscus vulgaris
UPDATED: 1745-05-02
COMMON NAME: unicorn
CLASSIFICATION: equus monoceros
UPDATED: 1738-11-24
When the shell sees the keyword for
, it knows it is supposed to repeat a command (or group of commands) once for each thing in a list. In this case, the list is the two filenames. Each time through the loop, the name of the thing currently being operated on is assigned to the variable called filename
. Inside the loop, we get the variable’s value by putting $
in front of it: $filename
is basilisk.dat
the first time through the loop, unicorn.dat
the second, and so on.
By using the dollar sign we are telling the shell interpreter to treat filename
as a variable name and substitute its value on its place, but not as some text or external command. When using variables it is also possible to put the names into curly braces to clearly delimit the variable name: $filename
is equivalent to ${filename}
, but is different from ${file}name
. You may find this notation in other people’s programs.
Finally, the command that’s actually being run is our old friend head
, so this loop prints out the first three lines of each data file in turn.
Follow the Prompt
The shell prompt changes from
$
to>
and back again as we were typing in our loop. The second prompt,>
, is different to remind us that we haven’t finished typing a complete command yet. A semicolon,;
, can be used to separate two commands written on a single line.
Imagine we want to extract the second line from each *.dat
file. Here’s a slightly more complicated loop:
for filename in *.dat
do
head -n 2 $filename | tail -n 1
done
The shell starts by expanding *.dat
to create the list of files it will process. The loop body then executes the command for each of those files.
We can redirect the output of a loop to a file like we did before with wc
:
for filename in *.dat
do
head -n 2 $filename | tail -n 1
done > classifications.txt
Measure Twice, Run Once
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo the commands it would run instead of actually running them. For example, we could write our file copying loop like this:
for filename in *.dat do echo cp $filename original-$filename done
Instead of running
cp
, this loop runsecho
, which prints out:cp basilisk.dat original-basilisk.dat cp unicorn.dat original-unicorn.dat
without actually running those commands. We can then use up-arrow to redisplay the loop, back-arrow to get to the word
echo
, delete it, and then press Enter to run the loop with the actualcp
commands. This isn’t foolproof, but it’s a handy way to see what’s going to happen when you’re still learning how loops work.
Learning Objectives
- Write a shell script that runs a command or series of commands for a fixed set of files.
- Run a shell script from the command line.
- Write a shell script that operates on a set of files defined by the user on the command line.
- Create pipelines that include shell scripts you, and others, have written.
We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.
Let’s start by going back to molecules/
and putting the following line into a new file, middle.sh
:
$ cd molecules
$ nano middle.sh
The command nano middle.sh
opens the file middle.sh
within the text editor “nano” (which runs within the shell). If the file does not exist, it will be created. We can use the text editor to directly edit the file. We’ll simply insert the following line:
head -n 15 octane.pdb | tail -n 5
This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb
. Remember, we are not running it as a command just yet: we are putting the commands in a file.
Then we save the file (using Ctrl-O), and exit the text editor (using Ctrl-X). Check that the directory molecules
now contains a file called middle.sh
.
Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash
, so we run the following command:
$ bash middle.sh
ATOM 9 H 1 -4.502 0.681 0.785 1.00 0.00
ATOM 10 H 1 -5.254 -0.243 -0.537 1.00 0.00
ATOM 11 H 1 -4.357 1.252 -0.895 1.00 0.00
ATOM 12 H 1 -3.009 -0.741 -1.467 1.00 0.00
ATOM 13 H 1 -3.172 -1.337 0.206 1.00 0.00
Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.
Text vs. Whatever
We usually call programs like Microsoft Word or LibreOffice Writer “text editors”, but we need to be a bit more careful when it comes to programming. By default, Microsoft Word uses
.docx
files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters, and doesn’t mean anything to tools likehead
: they expect input files to contain nothing but the letters, digits, and punctuation on a standard computer keyboard. When editing programs, therefore, you must either use a plain text editor, or be careful to save files as plain text.
What if we want to select lines from an arbitrary file? We could edit middle.sh
each time to change the filename, but that would probably take longer than just retyping the command. Instead, let’s edit middle.sh
and replace octane.pdb
with a special variable called $1
:
$ nano middle.sh
Now, within “nano”, replace the text octane.pdb
with the special variable called $1
:
head -n 15 "$1" | tail -n 5
Inside a shell script, $1
means “the first filename (or other parameter) on the command line”. We can now run our script like this:
$ bash middle.sh octane.pdb
ATOM 9 H 1 -4.502 0.681 0.785 1.00 0.00
ATOM 10 H 1 -5.254 -0.243 -0.537 1.00 0.00
ATOM 11 H 1 -4.357 1.252 -0.895 1.00 0.00
ATOM 12 H 1 -3.009 -0.741 -1.467 1.00 0.00
ATOM 13 H 1 -3.172 -1.337 0.206 1.00 0.00
or on a different file like this:
$ bash middle.sh pentane.pdb
ATOM 9 H 1 1.324 0.350 -1.332 1.00 0.00
ATOM 10 H 1 1.271 1.378 0.122 1.00 0.00
ATOM 11 H 1 -0.074 -0.384 1.288 1.00 0.00
ATOM 12 H 1 -0.048 -1.362 -0.205 1.00 0.00
ATOM 13 H 1 -1.183 0.500 -1.412 1.00 0.00
Double-Quotes Around Arguments
We put the
$1
inside of double-quotes in case the filename happens to contain any spaces. The shell uses whitespace to separate arguments, so we have to be careful when using arguments that might have whitespace in them. If we left out these quotes, and$1
expanded to a filename likemethyl butane.pdb
, the command in the script would effectively be:head -n 15 methyl butane.pdb | tail -n 5
This would call
head
on two separate files,methyl
andbutane.pdb
, which is probably not what we intended.
We still need to edit middle.sh
each time we want to adjust the range of lines, though. Let’s fix that by using the special variables $2
and $3
for the number of lines to be passed to head
and tail
respectively:
$ nano middle.sh
head -n "$2" "$1" | tail -n "$3"
We can now run:
$ bash middle.sh pentane.pdb 15 5
ATOM 9 H 1 1.324 0.350 -1.332 1.00 0.00
ATOM 10 H 1 1.271 1.378 0.122 1.00 0.00
ATOM 11 H 1 -0.074 -0.384 1.288 1.00 0.00
ATOM 12 H 1 -0.048 -1.362 -0.205 1.00 0.00
ATOM 13 H 1 -1.183 0.500 -1.412 1.00 0.00
By changing the arguments to our command we can change our script’s behaviour:
$ bash middle.sh pentane.pdb 20 5
ATOM 14 H 1 -1.259 1.420 0.112 1.00 0.00
ATOM 15 H 1 -2.608 -0.407 1.130 1.00 0.00
ATOM 16 H 1 -2.540 -1.303 -0.404 1.00 0.00
ATOM 17 H 1 -3.393 0.254 -0.321 1.00 0.00
TER 18 1
This works, but it may take the next person who reads middle.sh
a moment to figure out what it does. We can improve our script by adding some comments at the top:
$ nano middle.sh
# Select lines from the middle of a file.
# Usage: middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
A comment starts with a #
character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people understand and use scripts.