Skip to content

Learning Through Example

Learning Through Example

Let’s run through an example, using real data, and put the topics discussed in this module to use!

Task:

I have two datasets:

  • My first dataset contains information about the medicare enrollees per county in the US.
  • My second dataset contains information about the incidence of cancer per county in California
  1. We will begin by loading both datasets into our code from the command line.
  2. Using a loop, we will iterate over the lines of our files
  3. We will be creating a lookup dictionary that links both datasets
  4. This dictionary will be used to detect common columns between the two datasets
  5. We will be combining both datasets into a large, singular dataset
  6. Finally, we will plot a graph between two columns in this final dataset

Code:

Viewing the Datasets:

!head med_county.csv
!head cancer_county.csv

Script:

%%file MedicareUsers.py
import sys
import fileinput
import re

my_dict_file= sys.argv[1]
Lookup_County={}
for each_line_of_text in fileinput.input(my_dict_file):
ShortNames = re.findall(r'CA-.*?\sCounty',each_line_of_text, re.I)
LongNames = re.findall(r'CA-(.*?)\sCounty',each_line_of_text, re.I)
if ShortNames:
if LongNames:
Lookup_County[LongNames[0]] = ShortNames[0]
# print ( ShortNames[0] + " : " + LongNames[0])

sample = open('cancermodified.csv', 'w')
input_file_to_change=sys.argv[2]
print('Short Name,Population at Risk,Cases,Crude Rate,Age-adjusted Rate,95% Confidence Interval-Lower Limit,95% Confidence Interval-Upper Limit,County.name', file = sample)
for each_line_of_text in fileinput.input(input_file_to_change):
splitcolumn_array = re.split(',',each_line_of_text.replace( '"' ,'').replace('\n',''))
if splitcolumn_array[0] in Lookup_County:
splitcolumn_array.append(Lookup_County[splitcolumn_array[0]])
res = str(splitcolumn_array)[1:-1]
resmod = res.replace("\'", "")
print(resmod, file = sample)
sample.close()

Running the Program:

!python MedicareUsers.py med_county.csv cancer_county.csv

Loading and Understanding the Data:

The datasets are already available and ready to load. They can be retrieved from the following google drive:

The files to be retrieved are: cancer_county.csv and med-county.csv

To use the files in our script, the files must be saved in the same folder as the script. I created a folder called cancer_med_ example and moved my saved files to this folder. I created a new python notebook within this new folder.

Using the ‘head’ function, we can take a look at the first few lines of each file.

The medicare dataset consist of 11 columns. The first few are identifiers of the county name (e.g. AL-Autauga County) followed by various statistics.

The cancer dataset consists of 7 columns. The first of which is the county name (e.g. Shasta), followed by various statistics.

My goal in this example is to link these two datasets based on the county name.

Step 1: Getting arguments from the command line, loading in functions, and creating some variables

We begin by naming our script %%file MedicareUsers.py

I want to accept the med_county.csv file from the command line then store it in a variable name.

To accept a variable from the command line, I use a library called 'sys'. I’m storing whatever argument file is found in  sys.argv[1] in the variable: my_dict_file.

Step 2:  Loops to the rescue

To iterate over each line of this file, we use a loop. It is useful to use print statements after certain steps to check if they are working in the way they are expected to. In this, we use a module/library called `fileinput`. We use a function to read the input and loop through the body of our file, line by line, and each time assigning it to the  each_line_of_text variable.  Then we do something simple and print each_line_of_text.  Now let’s print it, just looking at the top few lines.

Step 3: Simple regex matching & a conditional

They are given in two different forms so we will need to create a dictionary that can recognize that CA- Shasta County is the same place as Shasta. To create this dictionary, we must first find a way of looking for a match, using regex.

Taking a closer look, we can see the following formats:

In the Medicare Enrollee Dataset:
The county name is in the format: ‘STATE ABBREVIATION-  (Proper Noun Name) County

In the Cancer Incidence Dataset:

The county name is in the format: (Proper Noun Name)

Essentially, I want to pick only the name of the county from the longer format and create a lookup dictionary. As our second dataset is specific to the state of California, I used this to my advantage in the regex search.

Goal:

CA-Butte County should be a key for Butte

CA-Colusa County should be a key for Colusa

CA-Placer County should be a key for Placer

CA-Marin County should be a key for Marin

I import the re module then create a lookup dictionary. As suggested before, the best way to identify the required regular expression is by using the regex101.com resource. Looking deeper into the current statement:

  • r'' : Interpreting the string in its raw form
  • CA- : This is an exact match search. We are looking for a statement that begins with ‘CA-‘
  • .*?  : Any character. ‘.‘, zero or more times ‘*?
  • \s : Followed by a space
  • County : Returns an exact match once again
  • r'CA-(.*?)\sCounty' : Using the parentheses, I am matching the entire statement, but extracting the required group within the parentheses alone

I print out the matches as a confirmation once again!

Note: you have to scroll down a little to find the rows with CA counties present.

Step 4. Storing in a dictionary

Lookup vehicles are incredible uses of dictionaries and associative arrays. Essentially, we want to create a variable called Lookup_County. When given the key of ShortNames, it should yield the LongNames . Printing Lookup_County[Placer] would give, CA-Placer County.

I begin by initializing the dictionary. As we progress through the script, this dictionary will be filled. To create the dictionary we  would use a statement that looks something like this: Lookup_County[LongNames[0]] = ShortNames[0].  We open the dictionary, then assign ShortNames to LongNames. Once we create the dictionary, let’s print it out.

Now, we can detect common columns in the two datasets using this dictionary.

Step 5. Reading a second file, splitting the input into an array

We have created our dictionary, and now we want to read our file that needs to be changed.  We will have this be the second argument we get, and we will assign it to a variable input_file_to_change , then loop through line by line and split up the results into an array based on tabs.  An array is reasonable, particularly in a csv file, and split using regex is a nice way to split up a row, and place the value in each column within the array.

Step 6. Matching elements from new file to the lookup dictionary

Our dictionary is something like a safe. To access its contents, we need to provide a key that the dictionary can recognize. A quick google search can help us find out how to open the safe. We know that the county name in its short form is found in the 0th column of the cancer incidence dataset. We use a conditional statement that says, if the element in the 0th position of this array is found in our lookup dictionary, do the following statements: if splitcolumn_array[0] in Lookup_County:

In this scenario, I want to append my array to have a new column with the longer notation. splitcolumn_array.append(Lookup_County[splitcolumn_array[0]])

Step 6. Modifying and Printing to a second file

I would like to store the output of this script in a new CSV file. Before I do so, I will make some slight modifications to make the contents look like comma-separated values. I remove the square brackets surrounding each row using the statement: res = str(splitcolumn_array)[1:-1]

I add a header to the file as a CSV file needs one:

print('Short Name, Population at Risk, Cases,Crude Rate, Age-adjusted Rate, 95% Confidence Interval-Lower Limit, 95% Confidence Interval-Upper Limit, County.name ', file = sample)

And I remove the single quotes surrounding each element.

resmod = res.replace("\'", "")

To print the output to a second file, we must first open the CSV file. I insert this line before accepting my second file. sample = open('cancermodified.csv', 'w')

  • sample : I must use a variable to assign the opened file to
  • open : I am opening a file
  • 'countydetails.csv' : The first argument is the name of my new file
  • 'w'  : I would like to write to this file. Alternatively, you can open a file and read it using the ‘r’ mode.

Indentation is important in this step. The print statement is indented from the file creation and closing step. The final print(resmod, file = sample) will result in the output being printed to the file cancermodified.csv. We then close the file. The opening and closing statements must be indented to the same amount.

Check your file directory to find a new file called cancermodified.csv This file should contain the appended dataset we created.

Opening this new file we see:

It worked! We created a file, called ‘cancermodified.csv’, that contains an appended column.

Reflection: In this example, we used a lot of the core concepts of python such as associative arrays, arrays, regex, loops, conditionals, functions, and variables. Try using the methods described for other datasets as well to test your understanding. 

Going a Step Further:

Let’s see some ways where we can put our new datasets to use. The concepts described here explore some new libraries that are a little more advanced. I will explain what I am doing in simple terms and focus on the key takeaways of each step.

Loading in the data frames: **need to install pandas to the hub**

Pandas is an extremely useful library. You can easily create data frames, visualize, and manipulate the data using this library. To begin with, I am creating two data frames using the Pandas libraries.

cnmod = pd.read_csv('cancermodified.csv', skipinitialspace=True)
med = pd.read_csv('med_county.csv')

Since our two datasets contain a common column, County.name, we can perform an inner join. An inner join is the process of combing two tables and keeping the rows that they have in common. For example:

Dataset 1: 

S.No Code
1 ABA
2 AAA

Dataset 2: 

S.No UID
1 123
4 342

Inner join: This join is done based on a common parameter between the two sets of data. Only the data that is connected by the common element will be present in the new joined dataset

S.No Code UID
1 ABA 123

Quick Note: there are some other forms of joins that may be useful for to you. E.g. in an outer join, there wouldn’t be any loss of data. Often times, the code will look very similar, except for an extra argument specifying a join type. Here is a link to a webpage about the different types of joins for your reference: Joins

The joined table in this scenario would contain only the details of S.No 1, as it is the common element among the two.

In the cancer-medicare data, I only want to keep the rows that have both sets of information.

joined_data = cnmod.join(med.set_index('County.name'), on='County.name')
joined_data

My joined table looks like this:

Using a plot function, I am plotting a graph between two columns:

joined_data.plot(x='Medicare enrollees (2014)', y='Cases', style='o')