Skip to content

Exercise: Implementing BASH Practically

Exercise: Implementing BASH Practically

In this example, we are going to be using BASH to subset and extract information from a file that is publicly available,  Homo_sapiens. We will explore the concepts of extracting information, piping information to a new file, amending this file and finally visualizing the output.

Retrieve some sample data, and unzip it

This data is a GTF (gene transfer format) file. It contains information about the gene structure, mapping all known genes and transcripts to build 37 of the human genome as defined by the Ensemble effort led out of the European Bioinformatics Institute.  In this file, each line is a ‘feature’.  You can see that an ‘exon’ is a feature, as are ‘gene’, ‘microRNA’, etc.  Say we want to narrow the file down to only pull out the columns that contain “gene”. The following code aims to extract this information, along with the position of the gene, and store it in a new file.

The homo_sapiens file is a common reference file. You can learn more about the content here, and this is relevant for our efforts. http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/README

wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

Let’s unzip it

gunzip Homo_sapiens.GRCh37.75.gtf.gz

Let’s preview it

less is a program that allows us to browse a text file.  Many people actually alias more=less in their .bashrc file.  less is nice in that it allows and recognizes many vim commands such as ‘/search’.  To make our app, we will want a comma delimited file that it has ‘feature’,’chr’,’start’,’stop’.  There is clearly much more info here.

less Homo_sapiens.GRCh37.75.gtf

Note: to exit less, use the q quit command.

Let’s pull out a few fields

There are many ways to pull out a column and put it into a file.  One program that can do this is ‘cut’, another is ‘awk’.  Here we use cut just for the purposes of showing different approaches.  We use -f3 which means to cut out the 3rd column presuming space delimiters.  To see all of the options of cut, feel free to man cut.

cut -f3 Homo_sapiens.GRCh37.75.gtf | head

Output:

Let’s use awk instead for formatting

As we said, there are many ways to cut fields.  awkgawk or nawk would work too, as they are different versions of awk) is one of the most common linux/unix/bash tools for wrangling data. In the example below, we print the 2nd, 1st, 4th, and 5th columns for those lines where the 3rd column matches gene .  By default, it splits by white space, but you can change this if you need to.

awk '$3 ~ /gene/ { print $2,$1,$4,$5}' Homo_sapiens.GRCh37.75.gtf | head

Output:

The $n refers to the nth field.  The /text/ is how we match the lines we want to alter.  In this case, if there is a “gene” in the 3rd column, we print some the $2, $1, $4, and $5 field for the file Homo_sapiens.GRCh37.75.gtf

Let’s make it comma delimited.

We could have added commas in awk.  However, in this case, we show piping the results into sedsed is essentially a search and replace tool replacing the space with ,.  Remember that we are piping into head to just get a preview.  sed is based on regular expressions, another term for complex matching and substitution.  sed generally works as s/find/replace/g.  It becomes quite amazing how complex these can be.  Please go to http://regex101.com to learn about how to make complex matches.

awk '$3 ~ /gene/ { print $2,$1,$4,$5}' Homo_sapiens.GRCh37.75.gtf | head | sed 's/ /,/g' 

Redirecting into a file

We actually haven’t written any files yet.  We are essentially running programs and then piping to head which prints only the first 5 lines or so.  This provides us with a preview of what the multiple commands together are doing.  This time lets actually put the information into a file by using the redirect or > into the file gene_dist.csv

awk '$3 ~ /gene/ { print $2,$1,$4,$5}' Homo_sapiens.GRCh37.75.gtf | sed 's/ /,/g' > gene_dist.csv

Give it a first line header line

For later steps, we really need a first line that tells us what our columns are. By convention, csv files have headers. We have decided to use sed for that as it also allows us to insert lines.  Now, to be clear – I didn’t remember that at first, but upon googling, I found out how to do that functionality of sed. That is an important aspect that one does not need to remember every feature of every command.

sed -e "1ifeature,chr,start,end" gene_dist.csv > gene_dist_head.csv

To get started we will go back to our directory.

ls -l

Ouput:

We see our previous files, their permissions, and their size.

Finally, let’s look at our new file using the less function.

less gene_dist_head.csv

Output:

We can see a header at the top, followed by the data we were looking for!