Skip to content

Datascience Tools In Python

Python Resources

Python Data Science Handbook

Tools In Standard data structures

Conditionals

if 1 > 2:
    message = "1 is greater than 2"
elif 1 > 3:
    message = "1 is greater then 3"
else:
    message = "Neither of these are true"
print(message)

Sorting

Sometimes we need to sort lists.

x = [4,1,2,3]
y = sorted(x)
x.sort()

Sets

A type of data that represents a distinct set of elements. It’s helpful for when we need to know if elements are unique

s = set()
s.add(1) #s is now{1}
s.add(2)#s is now{1,2}
s.add(2) #s is still [1,2]
x = len(s)
y=2 in s
z=3 in s



Ranges

even_numbers=[x for x in range(5) if x%2 == 0 ]
zeroes = [0 for _ in even_numbers] # Use underrscore if you don't need the values from a list
# We can do this in pairs
pairs = [(x, y)
    for x in range(10)
    for y in range(10)]

Regex

Regular expressions search for text, replace text, and wrangle text.

import re
print(all([
    not re.match("a", "cat"),
    re.search("a", "cat"),
    not re.search("c", "dog"),
    3 == len(re.split("[ab]", "carbs")),
    "R-D-" == re.sub("[0-9]", "-", "R2D2") 
])) # prints True

Classes (Object Oriented)

This creates a new Dog class with no attributes (information or data about a data type) or methods (things you can do to a class)

class Dog: 
    species = "Canis familiaris"
    
    def __init__(self, name, age): 
        self.name = name
        self.age = age
    def description(self):
        return f"{self.name} is {self.age} years old"
    def speak(self, sound):
        return f"{self.name} says {sound}"

miles = Dog("Miles", 4)
miles.description()

Simple functions

def exp(base, power): 
    return base ** power

Simple Plotting

People often use the matplotlib

from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid') # add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()

Working with vectors (lists)

JSON

import json
import requests
response = requests.get("https://ghr.nlm.nih.gov/condition/alzheimer-disease?report=json")
disease = json.loads(response.text)
json.dumps(disease)
print(disease['name'])

NumPy

NumPy (short for Numerical Python) provides tools for dense data. It provides efficient storage and data operations as the arrays grow and form the core of nearly the entire ecosystem of data science tools in Python.

import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

Accessing

x=[10,11,12,13,14,15,16,17,18,19]
print(x[:5])
print(x[3:4])
print(x[3:4])
print(x[::2]) #skip2
print(x[::-2]) #reverse skip 2

2D Arrays in NumPY

m=[[1, 2, 3, 4],
[11, 12, 13, 14],
[21, 22, 23, 24]
]
print(m)
print(m[0][1])
a=m[:][2:3]
print(a)

Reshape

m=np.array([[1,  2,  3,  4],
   [11,  12,  13,  14],
   [21, 22, 23, 24]
  ])
print(m)
a=m.reshape(3,4)
print(a)

Loops are Bad, Use vectors

Loops are slow – so use ufuncs that operate on vectors

Aggregations

Operating & Booleans as masks/counts

np.sum(a>3)

Data Manipulation with Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

Series

import numpy as np 
import pandas as pd
gene_dict = {
    'KRAS': "k-ras",
    'PTEN': "Phosphatase and tensin homolog",
    'APOE': "Apolipoprotein",
    'MTFMT': "Mitochondrial methionyl-tRNA formyltransferase"
}
genenames=pd.Series(gene_dict)
genenames.PTEN

Dataframes

Rows as ordered numbers

pd.DataFrame([
    {'gene':'KRAS','ExpressionDay1':5.3,'ExpressionDay2':2.3},
    {'gene':'PTEN','ExpressionDay1':51.3,'ExpressionDay2':2.3},
    {'gene':'APOE','ExpressionDay1':3.3,'ExpressionDay2':0},
    {'gene':'MTFMT','ExpressionDay1':1.3},
])

Rows as indexes

pd.DataFrame(
    [
        {'ExpressionDay1':5.3,'ExpressionDay2':2.3},
        {'ExpressionDay1':51.3,'ExpressionDay2':2.3},
        {'ExpressionDay1':3.3,'ExpressionDay2':0},
        {'ExpressionDay1':1.3}
    ],
    index=['KRAS','PTEN','APOE',"MTFMT"]
)

Not a number

Sometimes, we don’t have an entry. It’s important to know this may impact some functions

pd.Series([1, np.nan, 2, None])

We can find out if something is null

e=pd.DataFrame([
    {'gene':'KRAS','ExpressionDay1':5.3,'ExpressionDay2':2.3},
    {'gene':'PTEN','ExpressionDay1':51.3,'ExpressionDay2':2.3},
    {'gene':'APOE','ExpressionDay1':3.3,'ExpressionDay2':0},
    {'gene':'MTFMT','ExpressionDay1':1.3},
])
e[e.ExpressionDay2.notnull()]

Sorting

e.sort_values(by=['ExpressionDay1'])