Exercise: Implementing BASH Practically
In this example, we are going to be using BASH to subset and extract information from a file that is publicly available, Homo_sapiens. We will explore the concepts of extracting information, piping information to a new file, amending this file and finally visualizing the output.
Retrieve some sample data, and unzip it
This data is a GTF (gene transfer format) file. It contains information about the gene structure, mapping all known genes and transcripts to build 37 of the human genome as defined by the Ensemble effort led out of the European Bioinformatics Institute. In this file, each line is a ‘feature’. You can see that an ‘exon’ is a feature, as are ‘gene’, ‘microRNA’, etc. Say we want to narrow the file down to only pull out the columns that contain “gene”. The following code aims to extract this information, along with the position of the gene, and store it in a new file.
The homo_sapiens file is a common reference file. You can learn more about the content here, and this is relevant for our efforts. http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/README
wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
Let’s unzip it
gunzip Homo_sapiens.GRCh37.75.gtf.gz
Let’s preview it
less
is a program that allows us to browse a text file. Many people actually alias more=less in their .bashrc file. less is nice in that it allows and recognizes many vim
commands such as ‘/search’. To make our app, we will want a comma delimited file that it has ‘feature’,’chr’,’start’,’stop’. There is clearly much more info here.
less Homo_sapiens.GRCh37.75.gtf
Note: to exit less, use the q
quit command.
Let’s pull out a few fields
There are many ways to pull out a column and put it into a file. One program that can do this is ‘cut’, another is ‘awk’. Here we use cut just for the purposes of showing different approaches. We use -f3
which means to cut out the 3rd column presuming space delimiters. To see all of the options of cut, feel free to man cut.
cut -f3 Homo_sapiens.GRCh37.75.gtf | head
Output:
Let’s use awk instead for formatting
As we said, there are many ways to cut fields. awk
( gawk
or nawk
would work too, as they are different versions of awk) is one of the most common linux/unix/bash tools for wrangling data. In the example below, we print the 2nd, 1st, 4th, and 5th columns for those lines where the 3rd column matches gene . By default, it splits by white space, but you can change this if you need to.
awk '$3 ~ /gene/ { print $2,$1,$4,$5}' Homo_sapiens.GRCh37.75.gtf | head
Output:
The $
n refers to the nth field. The /text/ is how we match the lines we want to alter. In this case, if there is a “gene” in the 3rd column, we print some the $2, $1, $4, and $5 field for the file Homo_sapiens.GRCh37.75.gtf
Let’s make it comma delimited.
We could have added commas in awk
. However, in this case, we show piping the results into sed
. sed
is essentially a search and replace tool replacing the space with ,
. Remember that we are piping into head to just get a preview. sed
is based on regular expressions, another term for complex matching and substitution. sed generally works as s/find/replace/g
. It becomes quite amazing how complex these can be. Please go to http://regex101.com to learn about how to make complex matches.
awk '$3 ~ /gene/ { print $2,$1,$4,$5}' Homo_sapiens.GRCh37.75.gtf | head | sed 's/ /,/g'
Redirecting into a file
We actually haven’t written any files yet. We are essentially running programs and then piping to head which prints only the first 5 lines or so. This provides us with a preview of what the multiple commands together are doing. This time lets actually put the information into a file by using the redirect or >
into the file gene_dist.csv
awk '$3 ~ /gene/ { print $2,$1,$4,$5}' Homo_sapiens.GRCh37.75.gtf | sed 's/ /,/g' > gene_dist.csv
Give it a first line header line
For later steps, we really need a first line that tells us what our columns are. By convention, csv files have headers. We have decided to use sed for that as it also allows us to insert lines. Now, to be clear – I didn’t remember that at first, but upon googling, I found out how to do that functionality of sed
. That is an important aspect that one does not need to remember every feature of every command.
sed -e "1ifeature,chr,start,end" gene_dist.csv > gene_dist_head.csv
To get started we will go back to our directory.
ls -l
Ouput:
We see our previous files, their permissions, and their size.
Finally, let’s look at our new file using the less function.
less gene_dist_head.csv
Output:
We can see a header at the top, followed by the data we were looking for!