Skip to content

# Python Resources

Python Data Science Handbook

## Tools In Standard data structures

### Conditionals

```if 1 > 2:
message = "1 is greater than 2"
elif 1 > 3:
message = "1 is greater then 3"
else:
message = "Neither of these are true"
print(message)```

### Sorting

Sometimes we need to sort lists.

```x = [4,1,2,3]
y = sorted(x)
x.sort()```

### Sets

A type of data that represents a distinct set of elements. It’s helpful for when we need to know if elements are unique

```s = set()
s.add(1) #s is now{1}
s.add(2)#s is now{1,2}
s.add(2) #s is still [1,2]
x = len(s)
y=2 in s
z=3 in s

```

### Ranges

```even_numbers=[x for x in range(5) if x%2 == 0 ]
zeroes = [0 for _ in even_numbers] # Use underrscore if you don't need the values from a list
# We can do this in pairs
pairs = [(x, y)
for x in range(10)
for y in range(10)]```

### Regex

Regular expressions search for text, replace text, and wrangle text.

```import re
print(all([
not re.match("a", "cat"),
re.search("a", "cat"),
not re.search("c", "dog"),
3 == len(re.split("[ab]", "carbs")),
"R-D-" == re.sub("[0-9]", "-", "R2D2")
])) # prints True```

### Classes (Object Oriented)

This creates a new `Dog` class with no attributes (information or data about a data type) or methods (things you can do to a class)

```class Dog:
species = "Canis familiaris"

def __init__(self, name, age):
self.name = name
self.age = age
def description(self):
return f"{self.name} is {self.age} years old"
def speak(self, sound):
return f"{self.name} says {sound}"

miles = Dog("Miles", 4)
miles.description()

```

### Simple functions

```def exp(base, power):
return base ** power```

### Simple Plotting

People often use the matplotlib

```from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid') # add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of \$")
plt.show()``` Working with vectors (lists)

### JSON

```import json
import requests
response = requests.get("https://ghr.nlm.nih.gov/condition/alzheimer-disease?report=json")
disease = json.loads(response.text)
json.dumps(disease)
print(disease['name'])```

# NumPy

NumPy (short for Numerical Python) provides tools for dense data. It provides efficient storage and data operations as the arrays grow and form the core of nearly the entire ecosystem of data science tools in Python.

```import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array``` ### Accessing

```x=[10,11,12,13,14,15,16,17,18,19]
print(x[:5])
print(x[3:4])
print(x[3:4])
print(x[::2]) #skip2
print(x[::-2]) #reverse skip 2```

### 2D Arrays in NumPY

```m=[[1, 2, 3, 4],
[11, 12, 13, 14],
[21, 22, 23, 24]
]
print(m)
print(m)
a=m[:][2:3]
print(a)```

### Reshape

```m=np.array([[1,  2,  3,  4],
[11,  12,  13,  14],
[21, 22, 23, 24]
])
print(m)
a=m.reshape(3,4)
print(a)```

### Loops are Bad, Use vectors

Loops are slow – so use ufuncs that operate on vectors ### Aggregations ### Operating & Booleans as masks/counts

`np.sum(a>3)`

# Data Manipulation with Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

### Series

```import numpy as np
import pandas as pd
gene_dict = {
'KRAS': "k-ras",
'PTEN': "Phosphatase and tensin homolog",
'APOE': "Apolipoprotein",
'MTFMT': "Mitochondrial methionyl-tRNA formyltransferase"
}
genenames=pd.Series(gene_dict)
genenames.PTEN```

### Dataframes

#### Rows as ordered numbers

```pd.DataFrame([
{'gene':'KRAS','ExpressionDay1':5.3,'ExpressionDay2':2.3},
{'gene':'PTEN','ExpressionDay1':51.3,'ExpressionDay2':2.3},
{'gene':'APOE','ExpressionDay1':3.3,'ExpressionDay2':0},
{'gene':'MTFMT','ExpressionDay1':1.3},
])```

### Rows as indexes

```pd.DataFrame(
[
{'ExpressionDay1':5.3,'ExpressionDay2':2.3},
{'ExpressionDay1':51.3,'ExpressionDay2':2.3},
{'ExpressionDay1':3.3,'ExpressionDay2':0},
{'ExpressionDay1':1.3}
],
index=['KRAS','PTEN','APOE',"MTFMT"]
)```

#### Not a number

Sometimes, we don’t have an entry. It’s important to know this may impact some functions

`pd.Series([1, np.nan, 2, None])`

We can find out if something is null

```e=pd.DataFrame([
{'gene':'KRAS','ExpressionDay1':5.3,'ExpressionDay2':2.3},
{'gene':'PTEN','ExpressionDay1':51.3,'ExpressionDay2':2.3},
{'gene':'APOE','ExpressionDay1':3.3,'ExpressionDay2':0},
{'gene':'MTFMT','ExpressionDay1':1.3},
])
e[e.ExpressionDay2.notnull()]```

### Sorting

`e.sort_values(by=['ExpressionDay1'])`