Unix Systems

As a data scientist it's almost impossible to avoid Unix systems. Unix vs Windows isn't a question of preference like R vs Python; if you don't know Unix then you're likely to struggle in the workplace. Unix is the operating system of data science, so you need to know how to use it.

One of the early stereotypes about a career in data science was that it was a ticket to being given a Macbook for work. This is because the macOS operating system is actually a Unix system - macOS is built on top of Apple's Darwin operating system, which is a Unix system. Being a Unix operating system means that macOS provides the best of both worlds for a data scientist - a clean and user-friendly operating system with a powerful and well-supported Unix system underneath.

This isn't to say that you must use macOS in order to be a successful data scientist - it is perfectly acceptable to use a Windows computer as long as you also know how to work with Unix systems. You can also install Linux on most "Windows" computers, which gives you all the benefits of a Unix-like system, with a slight trade-off in terms of polish and ease-of-use compared to macOS.

You may encounter Unix systems as a data scientist:

  • when working on the command line using macOS or Linux operating systems
  • when working with remote servers in your employer's data centre
  • when working with remote servers provided by AWS, Azure, GCP, etc
  • when working with shared Jupyter or RStudio servers (these are hosted on Unix servers)

Whilst not related to data science, you are probably also using Unix every time you browse the internet (almost every server on the internet is running a Unix operating system) and every time you use your phone (iOS and Android are both Unix operating systems).

What is Unix?

Unix is a family of operating systems that derive from the original Unix operating system developed by AT&T Bell Labs in the 1970's. Some of the more well-known Unix operating systems include macOS, Solaris, BSD, IBM AIX and HP-UX. The Linux operating system is often referred to as a Unix-like system however it is technically a Unix clone; for all practical purposes you can consider Linux distributions (including Ubuntu, Red Hat, etc) to be Unix-like operating systems.

Unix systems are characterized by a modular design that is sometimes called the "Unix philosophy". This concept entails that the operating system provides a set of simple tools that each performs a limited, well-defined function, with a unified filesystem (the Unix filesystem) as the main means of communication, and a shell scripting and command language (the Unix shell) to combine the tools to perform complex workflows. Unix distinguishes itself from its predecessors as the first portable operating system: almost the entire operating system is written in the C programming language, thus allowing Unix to reach numerous platforms.

Wikipedia: Unix

Understanding these "simple tools" is the key to working productively with Unix. For example:

  • learning bash (the Bourne Again SHell) lets you write simple scripts and navigate the Unix file system
  • learning ssh lets you connect to other Unix machines.
  • learning cat lets you print the contents of a file to the terminal
  • learning less lets you interactively scroll through long documents

This chapter will focus on teaching you how to navigate Unix filesystem, how to use these "small tools" to build complex workflows and get things done, and where to go when you need help.

You might have realised by now that "using Unix" means many different things depending on the context. Technically, every time you use a Macbook you are using Unix, but learning to use Safari to browse Facebook isn't exactly going to help you with data science projects.

Some examples of using Unix for data science include:

  • using ssh to connect to a remote analytics server
  • using scp to move files between
  • using a package manager (e.g. brew, apt, yum) to install command line tools
  • using command line tools to interact with third-party services
  • using wget to download files from the internet
  • using top to monitor system resource utilisation
  • using docker to manage and run containers

Of course it is possible to perform some of these tasks using Graphical User Interfaces (GUIs) without using the Unix command line, however there are a few key advantages of using the Unix CLI:

  1. New data science tools are normally written for Unix, and normally require the use of a Command Line Interface (CLI)
  2. Most tools with GUIs only let you use the most popular functionality - power users normally require access to the CLI to get all of the features they need
  3. Once you get the hang of them, most of the command line tools are much faster than the alternatives
  4. If you start writing your own tools for others to use, it is many orders of magnitude harder to create a GUI application. Command line interfaces are much easier to create.

And of course there are many times when there is no alternative but to use the command line to achieve something - for example if you want to run an R or Python script on a remote server then you'll need to use ssh to connect to the server and run the script.

Practicing Unix

If you're using a macOS computer then you already have Unix - you can simply open Terminal.app and you'll see a new window open where you can type and execute bash commands. This also applies if you're using Linux - look for an application called Terminal to get access to the command line.

If you're using Windows then you'll have to do a little more work to get access to a Unix-like system for practicing.

Click to see how to install a Unix-like environment in Windows…

The easiest option is to install VirtualBox which lets you install and run virtual machines on your computer. Once you have installed VirtualBox, go to the Ubuntu Server page and download Ubuntu Server 18.04.2 LTS. While that file is downloading you can get ready to start your virtual machine:

  1. open VirtualBox and click New to create a new Virtual machine
  2. give your virtual machine a name (e.g. "My Linux Machine")
  3. in the drop-down box for Type select Linux
  4. in the drop-down box for Version select Ubuntu (64-bit)
  5. click Continue
  6. when asked about memory size, select 1024MB and click Continue
  7. when asked about a virtual hard disk, select create a virtual hard disk now and then click Create
  8. when asked about hard disk file type, select VDI (VirtualBox Disk Image) and then click Continue
  9. when asked about storage on physical hard disk, select dynamically allocated and then click Continue
  10. when asked about file location and size, accept the default values and then click Create

You will then be taken back to the main screen of VirtualBox and you will see your new virtual machine is in the "off" state. Click Start to start the virtual machine, then VirtualBox will ask you for the location of a "virtual optical disk file" - this is referring to the .iso file you downloaded from Ubuntu earlier. Navigate to the location of this downloaded file then click Start.

You will see the machine start running within a few seconds, and it will install the Ubuntu operating system for you. Eventually it will start asking you some questions about language - use the keyboard and follow the prompts to complete the installation. You can accept all of the default values until you get to this screen:

You can enter whatever you like for these fields, making a note of your username and password so that you can log in to your virtual machine. You will need to use the tab key or arrow keys to move between text fields.

On the next page you may choose to install OpenSSH server (you do not need to, but you can use this for practice too if you like). Use the arrow keys to navigate to Done then press enter.

On the next screen, use the arrow keys to navigate to Done, then press enter. The installer will then take another minute or so to complete the installation, before prompting you to reboot. Press enter to reboot the virtual machine. The restart will pause to ask you to remove the installation CD - just press enter to proceed.

When the machine reboots it will pause when it is ready - press enter to bring up the username prompt. If you named your virtual machine linux-vm then your prompt will look like this:

If you used a different virtual machine name, that that name will appear here. To log in to your system:

  1. Enter your username, then press enter
  2. Enter your password, then press enter

You will then see about 20 lines of welcome information followed by a command prompt, which should look something like this:

You're now ready to practice using the Unix command line.

We will also look at how to connect to remote Unix servers using ssh (Mac) or PuTTY (Windows) later in this chapter.

Throughout this chapter we'll also use an embedded shell from repl.it to practice using the shell in a controlled environment. Try typing echo Hello World! into the window below, and then pressing Enter.

The Unix Shell

The shell gets it name from the notion that it is a user-facing "shell around the computer's whirring innards". The idea is that the designers of the operating system don't want users tinkering with the internals of the operating system, so they built a shell around it, and expect users of the system to interact with the computer using that shell.

When people say "shell" these days they are almost always referring to the Bourne Again SHell (bash) which is effectively a global standard across all of the major Unix systems. From the user point of view bash is an example of a REPL - a Read-Evaluate-Print Loop. If you haven't used command line tools before than this is a useful term to help understand how to use all interactive command line tools, including bash, R and Python. Every time you press enter in one of these interactive tools, the following events happen:

  1. the shell reads your command
  2. the shell evaluates your command, and calculates the output
  3. the shell prints the output to the terminal
  4. the shell prints a new prompt, ready for your next command

You can practice this using the shell below. Try running a few of these commands one after another (we'll cover what they mean in the next section):

  • ls
  • pwd
  • ps

You can see how the bash shell is just repeating these same steps over and over again:

read -> evaluate -> print

Basic commands

Commands can range from simple instructions (e.g. ls which lists the files in your current working directory) all the way to complex scripts with hundreds of commands. For this course we'll only look at how to use these simple commands to help you work in Unix environments.

Where am I?

The Unix file system is a little easier to understand than the Windows file system. At the very root of the file system is / - you can think of this like c:\ in Windows systems. This means that all file paths in Unix start with / in the same way that all Windows file paths begin with c:\. As a user of Unix systems you typically don't need to worry about which disk you are using, you only need to think about your location in the file system.

Folders are also separated in Unix using the forward slash / - this is different to Windows where the backslash \ is used to separate directories in a file path.

Sometimes you will also see Unix file paths end with another / - this is the same in Windows systems where file paths will often end with \. It is not required but is often used to show that the location is a directory rather than a file. As this is optional you do not need to type it, however it is a useful convention as it makes it clear to anyone reading that you're referring to a directory rather than a file.

These are all examples of valid directory paths on my macOS (Unix) computer:

  • /Users/perrystephenson/
  • /Users/perrystephenson/code
  • /usr/bin/

and these are examples of valid file paths on my macOS (Unix) computer:

  • /Users/perrystephenson/tweets.csv
  • /Users/perrystephenson/data/faces/training/s1/1.pgm
  • /Users/perrystephenson/code/dsp/10-Unix-Systems.Rmd (this file!)

The biggest difference between macOS and Linux systems in terms of the file system is the way the user folders are arranged. In macOS systems the user folders are all stored inside /Users/, whilst in many Linux systems (particularly Ubuntu) the user folders are stored inside /home/.

Now that you understand what the file system is, how do you know where you are? In a terminal, you can use the pwd command (print working directory), which will tell you where you are. Try typing pwd into the terminal below and then pressing enter - it should tell you that you are located at /home/runner.

The working directory is set to /home/runner because (in these embedded examples) we're running commands as a user called runner, and this is the home directory for the runner user.

Home Directories

A "home directory" in Unix systems is similar in concept to "My Documents" in Windows. It is a folder where you can store anything you like, and by default it is normally private and not accessible by any other users of the computer.

The home directory is used so often in Unix systems that it has it's own shortcut: ~. If you type echo ~ into the shell above you will see that it prints /home/runner again because ~ is just a shortcut to that folder. You can use ~ anywhere you would normally use a file path. For example, using the ~ (tilde) shortcut to show the locations of the three example files above:

  • ~/tweets.csv
  • ~/data/faces/training/s1/1.pgm
  • ~/code/dsp/10-Unix-Systems.Rmd

This has some advanges and drawbacks, besides the obvious reduction in typing. These file locations are now specific to my user account, which means that whilst they work for me they will not work for anyone else on my computer (because the ~ shortcut will point to their own folder, not mine). This can cause issues! On the other hand if I am writing an R script and want to write a temporary file to disk, I can write it to ~/temp.rds and be confident that no matter who runs the script, it will store the file in that user's home directory rather than my own.

The key take away here is that ~ is a shortcut to your home directory.

What is here?

One of the first things you might want to do is look around and see what is in your folder. To do this you'll use the ls command (list directory contents). Try running the ls command in the shell below.

You should see three files listed: file_2.R, file_3.py and main.sh. If you want to see more detail about these files, you can use ls -l to print the output in "long" format. Try this again using the shell above.

You can now see lots more information about the files in this directory, including:

  • File permissions (beyond the scope of this course)
  • File ownership (beyond the scope of this course)
  • File size in bytes
  • Last modified date and time
  • File name

Arguments

When you type ls -l, you are providing an argument to the ls command. In general, when you execute a command in Unix systems, everything you type after the name of the command is an argument and you can provide more than one if needed.

Arguments allow you to:

  • provide instructions to the command about how you want it to work
  • provide input data to the command

Most Unix commands have more arguments than anyone could possibly remember, so for this course we'll only learn about the useful arguments as they are needed.

There is one argument you should remember because it works for almost every Unix command: --help. This argument opens a file viewer which lets you read the help documentation for the command. You can try this out in the shell below using ls --help, which will print the help documentation for ls to the terminal (you will need to scroll up to read it all).

Moving around

So far we've learned how to see where we are (pwd) and look at which files are in the directory (ls). The next thing we need to learn to do it move around - this is done using the cd command (change directory).

If you use ls in the shell below, you'll see that there are two directories - dir1 and dir2. If you use pwd you'll also see that we're in the same directory as before - /home/runner.

To move into the dir1 directory you just need to type cd dir1 and press enter. If you then type pwd again you'll see that you're now in /home/runner/dir1 - success! You can use ls to look around and see that there are two files in this folder: file1 and file2.

To move back "up" one level to the /home/runner directory, you can use one of two approaches:

  • use cd /home/runner to change to the directory using the full path
  • use cd .. to move "up" one directory using the .. shortcut

There is an even easier shortcut for going back to your home directory:

  • use cd with no arguments to go straight back to your home directory.

For practice:

  • use pwd to confirm that you are back in /home/runner
  • use cd to navigate inside dir2
  • use ls list the files inside that directory

File manipulation

Now that we can move around the file system, we need to learn to make changes to files using the bash shell. We're going to use three commands:

We will look at each of these commands in detail

Copy (cp)

The cp command copies a file from one location to another. The syntax for this command is cp [from] [to].

Using the shell below, you can use the following commands to copy two files from inside dir1 into the home directory:

  1. cp /home/runner/dir1/file1 /home/runner/
  2. cp /home/runner/dir1/file2 /home/runner/

You can now use ls to confirm that these two copy operations worked correctly and that copies of the files are now in your home directory.

This is the safest way to copy, however there are some shortcuts that you can use to make it much faster to type:

  • Instead of using full paths in the from argument, you can use relative paths. Assuming that your working directory is /home/runner you can use dir1/file1 instead of typing out the whole path.
  • Instead of typing out the current directory in the to argument, you can use the . shortcut. . is a shortcut to your current working directory, so in this case you can replace /home/runner with just ..

Using both of these together, you could write the following:

You can also use cp to copy whole directories. Because this requires cp to scan the whole directory (and any subdirectories) and then copy each and every file, you need to use the -r argument to tell cp to copy recursively.

For example, to copy dir2 inside dir1, you could use cp -r dir2 dir1 (and then use cd and ls to check that it worked).

Move (mv)

The mv command moves a file from one location to another. You can think of it as being just like the cp command, except that it deletes the original file/directory. You can try this out using the shell below - use the command mv dir2 dir1 to move dir2 inside dir1. Use ls and cd to confirm that dir2 has been removed from your home directory, and is now inside dir1.

Note that you do not need to use the -r argument when moving directories.

Rename (mv)

You can also use the mv command to rename files and directories. You can think of this like "moving to a new name", because you can move and rename at the same time. Use the shell below to try using mv to rename:

  • to rename dir2 to dir3, use mv dir2 dir3
  • to rename main.sh to new.sh and move it inside dir1, use mv main.sh dir1/new.sh

Use ls and cd to move around and confirm that the renaming and moving have worked as expected.

*Note that you can also rename files and directories during cp operations, using the same syntax.**

Remove (rm)

The rm command removes files and directories.

  • To remove a file, use rm [filename]
  • To remove a directory, use rm -rf [directory]

The -rf argument when removing a directory is similar to the -r (recursive) argument used with the cp command, but in this case we're also using the f argument which means force. This is because deleting a whole directory is a potentially dangerous operation and you're using the f argument to tell the shell that you're really sure about what you are doing.

Use the shell below to:

  • remove main.sh (rm main.sh)
  • remove dir2 (rm -rf dir2)

You can then use ls to confirm that both files have been removed.

Be careful when using rm

Unlike Windows Explorer and macOS Finder, there is no Recycle Bin when using the shell. Once you delete a file or a directory you will never be able to get it back.

This is especially dangerous because it would be very easy to make a typo and run rm -rf /, which would delete every file on your computer. Make sure you never do this!

Directories

The mkdir command lets you make a directory inside your current working directory. To create a new directory newdir inside the dir1 directory, you can use cd dir1 to change your working directory to dir1, then use mkdir newdir to create the new directory. Use ls to confirm that the command was successful.

Working with files

Now you know how to manipulate files in the Unix filesystem, but what good is this if you can't read or edit files? This comes up all the time in the workplace:

  • Working on a remote server (via SSH, which we will cover below) and you want to see the code in a script
  • Working on a remote server and you want to make a small edit to a script
  • Inspecting the output of a script and you want to check the first few lines in a very long CSV
  • Checking the contents of a configuration file

Printing a file

The cat command lets you concatenate and print files to the screen. The syntax is cat [file1] [file2] [file3] ... and the command will print one file after another. Of course the concatenation feature is rarely used, and the most common usage of cat is printing a single file to the screen.

Using the shell below, use cat .gitignore to print the contents of .gitignore to the screen.

You should see that the file contains a list of files and folders to be excluded from Git commits.

Viewing a long file

You will regularly need to read files which are far longer than you can comfortably view in a terminal window. For these files you can use more or less.

The more command lets you view the first screen-full of content, and then lets you scroll through the document using space (which jumps a whole screen ahead) or enter (which jumps a line at a time). To use the more command to view a very long file in the shell below, type more LICENSE.md.

When you've finished viewing the document, press q to quit and return to the command prompt (or simply press space until you get to the end of the file).

The less command provides additional functionality, because it also lets you scroll backwards through a document. It's also a bit easier to use because it lets you use the arrow keys to navigate the document. less is not installed on the embedded shell above, however you will find it on almost every system you encounter in real life.

If you just want to look at a few lines of a document, you can use the head or tail commands which let you view the top or bottom of a document quickly.

  • head rstats_tweets_2017.csv will print the first 5 lines of the file
  • head -1 rstats_tweets_2017.csv will print the first line of the file
  • tail -1 rstats_tweets_2017.csv will print the last line of the file

You can try each of these commands in the editor above.

Editing a file

Editing files in the shell can be fairly frustrating, and most people generally try to avoid it when possible. If you do find yourself with a need to edit a file directly in the shell (for example when tinkering with something on a remote server) you can normally use one of the following command line tools:

  • nano - easiest to use, but less powerful than vim. Not always installed.
  • vim - very powerful, but hard to use. Installed by default on most systems.
  • vi - prehistoric precursor to vim. Only use this if nano and vim are unavailable on your system.

In all cases, opening a text editor is as simple as:

Saving and closing is fairly straight-forward in nano (the keyboard commands are displayed on screen at all times) and basically impossible to remember in vim/vi. If you are going to need to edit files in the shell regularly it is probably worth learning at least some of the basic commands in vim, which can be done by using the vimtutor program installed alongside vim.

We've surpassed the capabilities of the embedded bash shell so you will not be able try using any of these tools without crashing the shell. You can try using them both on your own machine - nano and vim are both installed on macOS systems by default, and should be installed on most Linux distributions if you're running Linux using a virtual machine on Windows.

Creating a file

As with editing, most of the time you will want to avoid using the command line to create new files. However I will show you two ways to do this so that you know what to search on Google if you ever need to do this.

You can run both of these commands in the shell below, and then use ls and cat to inspect the files you have created.

Connecting to servers

One of the most powerful features of the Unix command line is the ability to easily and seamlessly connect to remote servers; this is achieved using a tool called Secure Shell (SSH). SSH is a very powerful tool and it probably requires a whole course all by itself just to cover the most common features. In general however, the basic workflow for SSH looks something like this:

  1. Use ssh <server address> to connect to the server
  2. Enter your username and password interactively when prompted
  3. You will be connected to the remote server, and probably see a welcome message
  4. You can use the bash shell to interact with the server, including launching applications (like ls, pwd, less and others that we have learned about), launching scripts in R or Python (we'll cover this in the next section), and just about anything else you can do on the command line on your own machine
  5. When you're finished, use exit to close the connection and return to your own machine.

Many organisations will have some additional security controls in place when using SSH. Common security controls include:

  • Banning the use of usernames and passwords to log in (you will need to use a set of cryptographic keys to connect)
  • Using a "jump host" to let users connect to servers in secure environments - in this case you just need to connect to the jump host using SSH and then use SSH again to connect to the server you need access to.

Due to these differences in how most organisations use SSH we won't be practicing how to use it as part of this course. If you do need to use it in the workplace, most organisations will have instuctions on how to connect to servers using SSH.

You can connect to remote Unix systems from Windows PCs using a tool called PuTTY. It is generally a little harder to use than SSH (one of the many cases where command line tools are easier to use than GUI tools!). As with SSH, if you're planning to use PuTTY at work then you will likely find someone else has already prepared information to help you connect.

Working with R and Python

One common task that you will likely want to perform over and over again when using the command line is running R and Python scripts, or interactively using the R and Python REPLs.

R scripts

R scripts are executed using the Rscript program, which is installed whenever you install R. To run a script called my_script.R from the command line, you just need to type the following:

This will run the entire R program from start to finish, and will print any outputs generated (normally using print() or message() statements) to the terminal.

R REPL

You can run R interactively by just typing R in the shell - you can quit the R application by running q() in the R REPL. You will find that running R in this way is pretty unpleasant (no RStudio!) however it can be really useful for troubleshooting R code on remote machines. Using R interactively can help you identify missing packages, environmental differences, or other things which could be causing unintended operation of your scripts.

R Shiny

If you want to run R Shiny in such a way that colleagues can see your work, then you will probably want to install Shiny Server on a Linux machine. This is beyond the scope of the course, however you can find the installation instructions here.

Python scripts

Running a Python script from the command line is just like running an R script, except that you use the main python3 application rather than a specific "script" version. To run my_script.py from the command line, you just need to type the following

This will run the entire Python program from start to finish, and will print any outputs generated to the terminal.

Python REPL

Whilst you can run Python interactively (by just typing python3) you probably will prefer using IPython. Assuming you have already installed IPython (if you installed Python using Anaconda then you likely already have IPython installed), you can simply type ipython to enter the IPython 3 REPL. You can quit the application by typing quit.

You can also launch the Jupyter Console (which is essentially the same thing, but launched using Jupyter) by typing jupyter console at the command line.

Python packages

Unlike R, Python packages are installed from outside Python, using an application called pip from the shell.

To install packages using pip, you just need to run

sh pip install <packagename>.

For example, to install tensorflow, you would type pip install tensorflow.

Further reading

There is so much to learn about Unix that you'll likely spend your whole career learning about handy tools and neat ways of solving problems. For anything that comes up in the process of using Unix you're probably best served by using Google to search for what you're trying do, and probably ending up looking at Stack Overflow.

For more structured learning materials you can learn more about Unix and Bash by reading The Unix Workbench by Sean Kross. Julia Evans ( @b0rk ) also publishes handy bite-size comics (Wizard Zines) which help explain complex Unix tools and concepts. Learn Python the Hard Way (one of the related materials from the Python Module) also includes a Command Line Primer which might be helpful.