Python Resources
Tools In Standard data structures
Conditionals
if 1 > 2: message = "1 is greater than 2" elif 1 > 3: message = "1 is greater then 3" else: message = "Neither of these are true" print(message)
Sorting
Sometimes we need to sort lists.
x = [4,1,2,3] y = sorted(x) x.sort()
Sets
A type of data that represents a distinct set of elements. It’s helpful for when we need to know if elements are unique
s = set() s.add(1) #s is now{1} s.add(2)#s is now{1,2} s.add(2) #s is still [1,2] x = len(s) y=2 in s z=3 in s
Ranges
even_numbers=[x for x in range(5) if x%2 == 0 ] zeroes = [0 for _ in even_numbers] # Use underrscore if you don't need the values from a list # We can do this in pairs pairs = [(x, y) for x in range(10) for y in range(10)]
Regex
Regular expressions search for text, replace text, and wrangle text.
import re print(all([ not re.match("a", "cat"), re.search("a", "cat"), not re.search("c", "dog"), 3 == len(re.split("[ab]", "carbs")), "R-D-" == re.sub("[0-9]", "-", "R2D2") ])) # prints True
Classes (Object Oriented)
This creates a new Dog
class with no attributes (information or data about a data type) or methods (things you can do to a class)
class Dog: species = "Canis familiaris" def __init__(self, name, age): self.name = name self.age = age def description(self): return f"{self.name} is {self.age} years old" def speak(self, sound): return f"{self.name} says {sound}" miles = Dog("Miles", 4) miles.description()
Simple functions
def exp(base, power): return base ** power
Simple Plotting
People often use the matplotlib
from matplotlib import pyplot as plt years = [1950, 1960, 1970, 1980, 1990, 2000, 2010] gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3] # create a line chart, years on x-axis, gdp on y-axis plt.plot(years, gdp, color='green', marker='o', linestyle='solid') # add a title plt.title("Nominal GDP") # add a label to the y-axis plt.ylabel("Billions of $") plt.show()
Working with vectors (lists)
JSON
import json import requests response = requests.get("https://ghr.nlm.nih.gov/condition/alzheimer-disease?report=json") disease = json.loads(response.text) json.dumps(disease) print(disease['name'])
NumPy
NumPy (short for Numerical Python) provides tools for dense data. It provides efficient storage and data operations as the arrays grow and form the core of nearly the entire ecosystem of data science tools in Python.
import numpy as np np.random.seed(0) # seed for reproducibility x1 = np.random.randint(10, size=6) # One-dimensional array x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
Accessing
x=[10,11,12,13,14,15,16,17,18,19] print(x[:5]) print(x[3:4]) print(x[3:4]) print(x[::2]) #skip2 print(x[::-2]) #reverse skip 2
2D Arrays in NumPY
m=[[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24] ] print(m) print(m[0][1]) a=m[:][2:3] print(a)
Reshape
m=np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24] ]) print(m) a=m.reshape(3,4) print(a)
Loops are Bad, Use vectors
Loops are slow – so use ufuncs that operate on vectors
Aggregations
Operating & Booleans as masks/counts
np.sum(a>3)
Data Manipulation with Pandas
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.
Series
import numpy as np import pandas as pd gene_dict = { 'KRAS': "k-ras", 'PTEN': "Phosphatase and tensin homolog", 'APOE': "Apolipoprotein", 'MTFMT': "Mitochondrial methionyl-tRNA formyltransferase" } genenames=pd.Series(gene_dict) genenames.PTEN
Dataframes
Rows as ordered numbers
pd.DataFrame([ {'gene':'KRAS','ExpressionDay1':5.3,'ExpressionDay2':2.3}, {'gene':'PTEN','ExpressionDay1':51.3,'ExpressionDay2':2.3}, {'gene':'APOE','ExpressionDay1':3.3,'ExpressionDay2':0}, {'gene':'MTFMT','ExpressionDay1':1.3}, ])
Rows as indexes
pd.DataFrame( [ {'ExpressionDay1':5.3,'ExpressionDay2':2.3}, {'ExpressionDay1':51.3,'ExpressionDay2':2.3}, {'ExpressionDay1':3.3,'ExpressionDay2':0}, {'ExpressionDay1':1.3} ], index=['KRAS','PTEN','APOE',"MTFMT"] )
Not a number
Sometimes, we don’t have an entry. It’s important to know this may impact some functions
pd.Series([1, np.nan, 2, None])
We can find out if something is null
e=pd.DataFrame([ {'gene':'KRAS','ExpressionDay1':5.3,'ExpressionDay2':2.3}, {'gene':'PTEN','ExpressionDay1':51.3,'ExpressionDay2':2.3}, {'gene':'APOE','ExpressionDay1':3.3,'ExpressionDay2':0}, {'gene':'MTFMT','ExpressionDay1':1.3}, ]) e[e.ExpressionDay2.notnull()]
Sorting
e.sort_values(by=['ExpressionDay1'])