Fundamentals (Week 1)

1 Fundamentals (Week 1)
2 Data manipulation with Pandas (Week 2)
3 Building Programs (Week 3)
4 Visualization with Matplotlib and Seaborn (Week 4)
5 Special Topics
6 Endnotes

Fundamentals (Week 1)

Orientation

What programming language should I use?

Use the language that your friends use (so you can ask them for help)
Use a language that has a community of practice for your desired use case (you can find documentation, bug reports, sample code, etc.)
Use a language that is "best" by some technical definition

Python is pretty good at lots of things

"Glue" language intended to replace shell and Perl
Concise, readable, good for rapid prototyping
Access to linear algebra libraries in FORTRAN/C → user-friendly numeric computing
General purpose, not just an academic language; we will spend more time on some of the general purpose aspects.

Literate programming and notebooks

Blend code, documentation, and visualization
Good for trying things, demos
Bad for massive or long-running processes
You can export notebooks as .py files when they outgrow the notebook format

Jupyter commands

How to start Jupyter Lab

Method 1
1. Open Anaconda Navigator
2. Run Jupyter Lab
Method 2 Open Terminal (MacOS/Linux) or Anaconda Prompt (Windows)
```
cd Desktop/data
jupyter lab
```

Navigation

Navigate to where you want to be before creating new notebook
Rename your notebook to something informative
Use drag-and-drop interface to move .ipynb file to new location

Writing code

Execute cell with CTRL-Enter
```
3 + 7
```
Execute cell and move to new cell with Shift-Enter
```
# This is a comment
print("hello")
```
Cells can be formatted as Code or Markdown
Many keyboard shortcuts are available; see https://gist.github.com/discdiver/9e00618756d120a8c9fa344ac1c375ac
(Optional) Jupyter Lab understands (some) terminal commands
```
ls
```
(Optional) Jupyter Lab (IPython, actually) has "magic" commands that start with % (line) or %% (cell)
```
# Print current items in memory
%dirs

# Get environment variables
%env

# Cell magic: Run bash in a subprocess
%%bash

# Cell magic: Time cell execution
%%time
```
- The magic command must be in the first line of the cell (no comments)
- Some commands are not available on Windows (e.g. %%bash)

Variables and Assignment

Use variables to store values

Variables are names for values.

first_name = 'Derek'
age = 42

Rules for naming things

Can only contain letters, digits, and underscore
Cannot start with a digit
Are case sensitive: age, Age and AGE

Use `print()` to display values

print(first_name, 'is', age, 'years old')

Functions are verbs
Functions end in ()
Functions take arguments (i.e. they do stuff with the values that you give them)
print() useful for tracking progress, debugging

Jupyter Lab will always echo the last value in a cell

Python will evaluate and echo the last item
```
first_name
age
```
If you want to see multiple items, you should explicitly print them
```
print(first_name)
print(age)
```

(Optional) Variables must be created before they are used

# Prints an informative error message; more about this later
print(last_name)

Variables can be used in calculations

print(age)
age = age + 3
print(age)

Challenge: Variables only change value when something is assigned to them

Order of operations matters!

first = 1
second = 5 * first
first = 2

# What will this print?
print('first:', first)
print('second:', second)

Data Types and Type Conversion

Every value has a type

Most data is text and numbers, but there are many other types.

Integers: whole numbers (counting)
Floats: real numbers (math)
Strings: text
Files
Various collections (lists, sets, dictionaries, data frames, arrays)
More abstract stuff (e.g., database connection)

The type determine what operations you can perform with a given value

Example 1: Subtraction makes sense for some kinds of data but not others
```
print(5 - 3)
print('hello' - 'h')
```
Example 2: Some things have length and some don't Note that we can put functions inside other functions!
```
print(len('hello'))
print(len(5))
```

Use the built-in function `type()` to find the type of a value

Two types of number
```
print(type(53))
print(type(3.12))
```
You can check the type of a variable
```
fitness = 'average'
type(fitness)
```
Python is strongly-typed: It will (mostly) refuse to convert things automatically. The exception is mathematical operations with integers and floats.
```
int_sum = 3 + 4
mixed_sum = 3 + 4.0

print(type(int_sum))
print(type(mixed_sum))
```

You can explicitly convert data to a different type

Can't do math with text
```
1 + '2'
```
If you have string data, you can explicitly convert it to numeric data…
```
print(1 + float('2'))
print(1 + int('2'))
```

…and vice-versa

text = str(3)

print(text)
print(type(text))

What's going on under the hood?
1. int, float, and str are types. More precisely, they are classes.
2. int(), float(), and str() are functions that create new instances of their respective classes. The argument to the creation function (e.g., '2') is the raw material for creating the new instance.
This can work for more complex data types as well, e.g. Pandas data frames and Numpy arrays.

Challenge: Explain what each operator does

# Floor
print('5 // 3:', 5 // 3)

# Floating point
print('5 / 3:', 5 / 3)

# Modulus (remainder)
print('5 % 3:', 5 % 3)

Built-in Functions and Help

How do we find out what's possible?

Python.org tutorial
Standard library reference (we will discuss libraries in the next section)
References section of this document
Stack Overflow

(Optional) Use comments to add documentation to programs

Leave notes for Future You about what you've learned and how your code works.

# This line isn't executed by Python
print("This cell has many comments")   # The rest of this line isn't executed either

A function may take zero or more arguments

print('before')
print()
print('after')

Functions can have optional arguments

# By default, we round to the nearest integer
round(3.712)

# You can optionally specify the number of significant digits
round(3.712, 1)

Use the built-in function `help()` to get help for a function

View the documentation for round()
```
help(round)
```
- 1 mandatory argument
- 1 optional argument with a default value: ndigits=None

You can proved arguments implicitly by order, or explicitly in any order

# You can optionally specify the number of significant digits
round(4.712823, ndigits=2)

Every function returns something

Collect the results of a function in a new variable. This is one of the ways we build complex programs.

# You can optionally specify the number of significant digits
rounded_num = round(4.712823, ndigits=2)
print(rounded_num)

result = len("hello")
print(result)

(Optional) Some function only have "side effects"; they return None
```
result = print("hello")
print(result)
# print(type(result))
```

(Optional) Functions will typically generalize in sensible ways

max() and min() do the intuitively correct thing with numerical and text data

print(max(1, 2, 3))
print(min('a', 'A', '0'))       # sort order is 0-9, A-Z, a-z

Mixed numbers and text aren't meaningfully comparable
```
max(1, 'a')
```

(Optional) Python produces informative error messages

Python reports a syntax error when it can’t understand the source of a program
```
name = 'Bob
age = = 54
print("Hello world"
```
Python reports a runtime error when something goes wrong while a program is executing

(Optional) Beginner Challenge: What happens when?

Explain in simple terms the order of operations in the following program: when does the addition happen, when does the subtraction happen, when is each function called, etc. What is the final value of radiance?

radiance = 1.0
radiance = max(2.1, 2.0 + min(radiance, 1.1 * radiance - 0.5))

Libraries

Most of the power of a programming language is in its libraries

https://docs.python.org/3/library/index.html

A program must `import` a library module before using it

import math

print(math.pi)
print(math.cos(math.pi))

Refer to things from the module as module-name.thing-name
Python uses "." to mean "part of" or "belongs to".

Use `help()` to learn about the contents of a library module

help(math)                      # user friendly

dir(math)                       # brief reminder, not user friendly

(Optional) Import shortcuts

Import specific items from a library module. You want to be careful with this. It's safer to keep the namespace.
```
from math import cos, pi

cos(pi)
```
Create an alias for a library module when importing it
```
import math as m

print(m.cos(m.pi))
```

Python has opinions about how to write your programs

import this

Lists

Lists are the central data structure in Python; we will explain many things by making analogies to lists.

A list stores many values in a single structure

fruits = ["apple", "banana", "cherry", "date", "elderberry", "fig"]
print(fruits)
print(len(fruits))

Lists are indexed by position, counting from 0

print("First item:", fruits[0])
print("Fifth item:", fruits[4])

You can get a subset of the list by slicing it

You slice a list from the start position up to, but not including, the stop position
```
print(fruits[0:3])
print(fruits[2:5])
```

You can omit the start position if you're starting at the beginning…

# Two ways to get the first 5 items
print(fruits[0:5])
print(fruits[:5])

…and you must omit the end position if you're going to the end (otherwise it's up to, but not including, the end!). This is useful if you don't know how long the list is:
```
# Everything but the first 3 items
print(fruits[3:])
```

You can add an optional step interval (every 2nd item, every 3rd item, etc.)

# First 5 items, every other item
print(fruits[0:5:2])

# Every third item
print(fruits[::3])

(Optional) Why are lists indexed from 0?

cf. https://stackoverflow.com/a/11364711

Slice endpoints are compliments In both cases, the number you see represents what you want to do.

# Get the first two items
print(fruits[:2])

# Get everything except the first two items
print(fruits[2:])

For non-negative indices, the length of a slice is the difference of the indices
```
len(fruits[1:3]) == 2
```

Challenge: Some other properties of indexes

Try these statements. What are they doing? Can you explain the differences in their behavior?

print(fruits[-1])
print(fruits[20])
print(fruits[-3:])

Solution

You can count backwards from the end with negative integers
Indexing beyond the end of the collection is an error

Lists are mutable

You can replace a value at a specific index location
```
fruits[0] = "apricot"
print(fruits)
```
Add an item to list with append(). This is a method of the list (more on this later!).
```
fruits.append("grape")
print(fruits)
```

Add the items from one list to another with extend()

more_fruits = ["honeydew", "imbe", "jackfruit"]

# Add all of the elements of more_fruits to fruits
fruits.extend(more_fruits)
print(fruits)

Many functions take collections as arguments

receiving_yards = [450, 370, 870, 150]
mean_yards = sum(receiving_yards)/len(receiving_yards)
print(mean_yards)

(Optional) Removing items from a list

Use del to remove an item at an index location

print(more_fruits)
del more_fruits[1]
print(more_fruits)

Use pop() to remove the last item and assign it to a variable. This is useful for destructive iteration.
```
f = fruits.pop()

print('Last fruit in list:', f)
print(fruits)
```

Lists can contain anything

You can mix data types

ages = ['Derek', 42, 'Bill', 24, 'Susan', 37]

# Get first pair
print(ages[0:2])

# Get all the names
print(ages[::2])

# Get all the ages
print(ages[1::2])

You can put lists inside other lists

ages.append(more_fruits)

# List in our list
print(ages)

# The last item is a list
print(ages[-1])

# Get an item from that list
print(ages[-1][0])

(Optional) Challenge: Reversing a list

Create a new list that contains all of the items from fruits in the reverse order.

Solution

rev_fruits = fruits[len(fruits)-1::-1]
print(rev_fruits)

For Loops

A `for` loop executes commands once for each value in a collection

"For each thing in this group, do these operations"

for fruit in fruits:
    print(fruit)

A for loop is made up of a collection, a loop variable, and a body
The collection, fruits, is what the loop is being run on.
The loop variable, fruit, is what changes for each iteration of the loop (i.e. the “current thing”)
The body, print(fruit), specifies what to do for each value in the collection.

The first line of the `for` loop must end with a colon, and the body must be indented

Whitespace is syntactically meaningful in Python!

for fruit in fruits:
print(fruit)

Loop variables can be called anything

for bob in fruits:
    print(bob)

The body of a loop can contain many statements

primes = [2, 3, 5]
for p in primes:
    squared = p ** 2
    cubed = p ** 3
    print(p, squared, cubed)

Create a new collection from an existing collection

We will learn how to vectorize this when we get to Numpy and Pandas

prime_exponents = []
for p in primes:
   prime_exponents.append(p**2)

print(prime_exponents)

Challenge: Accumulation

Get the total length of all the words in the fruits list.

Solution

total = 0
for f in fruits:
    total = total + len(f)

print(total)

(Optional) Use `range()` to iterate over a sequence of numbers

for number in range(0, 3):
    print(number)

range() produces numbers on demand (a "generator" function)
useful for tracking progress

(Optional) Use `enumerate()` to iterate over a sequence of items and their positions

for number, fruit in enumerate(fruits):
    print(number, ":", fruit)

(Optional) How do you know if an object is iterable?

Lists, dictionaries, and strings are iterable
```
hasattr(location, "__iter__")
```
Integers are not iterable
```
hasattr(5, "__iter__")
```

Don't use `for` loops with DataFrames or Numpy matrices

There is almost always a faster vectorized function that does what you want.

Strings and methods

Strings are (kind of) like lists

Strings are indexed like lists

# Use an index to get a single character from a string
fruit = "gooseberry"
print(fruit[0])
print(fruit[0:3])
print(fruit[-1])

Strings have length
```
len(fruit)
```

But! Strings are immutable

Can't change a string in place
```
fruit[0] = 'G'
```

Solution: String methods create a new string

fruit_title = fruit.capitalize()
print(fruit_title)

Methods are functions that belong to objects

An object packages data together with functions that operate on that data. This is a very common organizational strategy in Python.
```
sentence = "Hello world!"

# Call the swapcase method on the my_string object
print(sentence.swapcase())
```

You can chain methods into processing pipelines

print(sentence.isupper())          # Check whether all letters are uppercase
print(sentence.upper())            # Capitalize all the letters

# The output of upper() is as string; you can use more string methods on it
sentence.upper().isupper()

You can view an object's attributes (i.e. methods and fields) using help() or dir(). Some attributes are "private"; you're not supposed to use these directly.
```
# More verbose help
help(str)
```
```
# The short, short version
dir(my_string)
```

Use the built-in string methods to clean up data

bad_string_1 = "  Hello world!   "
bad_string_2 = "|...goodbye cruel world|"

print(bad_string_1.strip(),
      bad_string_2.strip("|"))

Building longer strings with `.join()`

Use .join() to concatenate strings

date_list = ["3", "17", "2007"]
date = "/".join(date_list)
print(date)

This is going to be useful for building CSV files

date_list = ["3", "17", "2007"]
date = ",".join(date_list)
print(date)

Challenge: Putting it all together

You want to iterate through the fruits list in a random order. For each randomly-selected fruit, capitalize the fruit and print it.

Which standard library module could help you? https://docs.python.org/3/library/
Which function would you select from that module? Are there alternatives?
Try to write a program that uses the function.

Solution 1 (shuffle)

import random

random.shuffle(fruits)

for f in fruits:
    print(f.title())

Solution 2 (sample)

random_fruits = random.sample(fruits, len(fruits))

for f in random_fruits:
    print(f.title())

(Optional) Beginner Challenge: From Strings to Lists and Back

Given this Python code…

print('string to list:', list('tin'))
print('list to string:', ''.join(['g', 'o', 'l', 'd']))

What does list('some string') do?
What does '-'.join(['x', 'y', 'z']) generate?

(Optional) Dictionaries

Dictionaries are sets of key/value pairs. Instead of being indexed by position, they are indexed by key.

ages = {'Derek': 42,
        'Bill': 24,
        'Susan': 37}

ages["Derek"]

Update dictionaries by assigning a key/value pair

Update a pre-existing key with a new value
```
ages["Derek"] = 44

print(ages)
```
Add a new key/value pair
```
ages["Beth"] = 19
print(ages)
```

(Optional) Check whether the dictionary contains an item

Does a key already exist?
```
"Derek" in ages
```
Does a value already exist (you generally don't want to do this; keys are unique but values are not)?
```
24 in ages.values()
```

(Optional) Delete an item using `del` or `pop()`

print("Original dictionary", ages)
del ages["Derek"]
print("1st deletion", ages)

susan_age = wave_fc.pop("Susan")
print("2nd deletion", ages)
print("Returned value", susan_age)

Dictionaries are the natural way to store tree-structured data

As with lists, you can put anything in a dictionary.

location = {'latitude': [37.28306, 'N'],
            'longitude': [-120.50778, 'W']}

print(location['longitude'][0])

Dictionary iteration

Iterate over key: value pairs

for key, val in ages.items():
    print(key, ":", val)

You can iterate over keys and values separately

# Iterate over keys; you can also explicitly call .keys()
for key in ages:
    print(key)

# Iterate over values
for val in ages.values():
    print(val)

Iteration can be useful for unpacking complex dictionaries

for key, val in location.items():
    print(key, 'is', val[0], val[1])

(Optional) Advanced Challenge: Convert a list to a dictionary

How can you convert our list of names and ages into a dictionary? Hint: You will need to populate the dictionary with a list of keys and a list of values.

# Starting data
ages = ['Derek', 42, 'Bill', 24, 'Susan', 37]

# Get dictionary help
help({})

Solution

ages_dict = dict(zip(ages[::2], ages[1::2]))

(Optional) Other containers

Tuples
Sets

Data manipulation with Pandas (Week 2)

(Optional) Review collections

Lists and dictionaries

Reference item by index/key
Insert item by index/key
Indices/keys must be unique

Strings

Similar to lists: Reference item by index, have length
Immutable, so need to use string methods
'/'.join() is a very useful method

A very brief introduction to NumPy

Introductory documentation: https://numpy.org/doc/stable/user/quickstart.html

NumPy is the linear algebra library for Python

import numpy as np

# Create an array of random numbers
m_rand = np.random.rand(3, 4)
print(m_rand)

Arrays are indexed like lists
```
print(m_rand[0,0])
```

Arrays have attributes

print(m_rand.shape)
print(m_rand.size)
print(m_rand.ndim)

Arrays are fast but inflexible - the entire array must be of a single type.

(Optional) Linear algebra with NumPy

x = np.arange(10)
y = np.arange(10)

print(x)
print(y)

Operations are element-wise by default
```
print(x * y)
```

Matrix-wise operations (e.g. dot product) use NumPy functions

# Use a special operator if it exists
print(x @ y)

# Otherwise, use a numpy function
print(np.dot(x, y))

(Optional) Matlab gotcha: 1-D arrays have no transpose
```
print(x)
print(x.T)
print(x.reshape(-1,1))
```

Challenge: Matrix operations

Create a 3x3 matrix containing the numbers 0-8. Hint: Consult the NumPy Quickstart documentation here: https://numpy.org/doc/stable/user/quickstart.html
Multiply the matrix by itself (element-wise).
Multiply the matrix by its transpose.
Divide the matrix by itself. What happens?

Solutions

# Use method chaining to link actions together
x = np.arange(9).reshape(3,3)

print(x * x)
print(x * x.T)
print(x / x)

A very brief introduction to Pandas

Pandas is a library for working with spreadsheet-like data ("DataFrames")
A DataFrame is a collection (dict) of Series columns
Each Series is a 1-dimensional NumPy array with optional row labels (dict-like, similar to R vectors)
Therefore, each series inherits many of the abilities (linear algebra) and limitations (single data type) of NumPy

(Optional) Where are we?

Python provides functions for working with the file system.

import os

# print current directory
print("Current working directory:", os.getcwd())
# print all of the files and directories
print("Working directory contents:", os.listdir())

These provide a rich Python alternative to shell functions

# Get 1 level of subdirectories
print("Just print the sub-directories:", sorted(next(os.walk('.'))[1]))

# Move down one directory
os.chdir("data")
print(os.getcwd())

# Move up one directory
os.chdir("..")
print(os.getcwd())

Reading tabular data into data frames

Import tabular data using the Pandas library

import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

# Jupyter Lab will give you nice formatting if you echo
data

File and directory names are strings
You can use relative or absolute file paths

Use `index_col` to use a column’s values as row indices

Rows are indexed by number by default (0, 1, 2,….). For convenience, we want to index by country:

data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

By default, rows are indexed by position, like lists.
Setting the index_col parameter lets us index rows by label, like dictionaries. For this to work, the index column needs to have unique values for every row.
You can verify the contents of the CSV by double-clicking on the file in Jupyter Lab

Pandas help files are dense; you should prefer the online documentation

Main documentation link: https://pandas.pydata.org/docs/user_guide/index.html
Pandas can read many different data formats: https://pandas.pydata.org/docs/user_guide/io.html

Data frames are objects that can tell you about their contents

Data frames have methods (i.e. functions) that perform operations using the data frame's contents as input

Use .info() to find out more about a data frame
```
data.info()
```
Use .describe() to get summary statistics about data
```
data.describe()
```
(Optional) Look at the first few rows
```
data.head(1)
```

Data frames have fields (i.e. variables) that hold additional information

A "field" is a variable that belongs to an object.

The .index field stores the row Index (list of row labels)
```
print(data.index)
```
The .columns field stores the column Index (list of column labels)
```
print(data.columns)
```
The .shape variable stores the matrix shape
```
print(data.shape)
```
Use DataFrame.T to transpose a DataFrame. This doesn't copy or modify the data, it just changes the caller's view of it.
```
print(data.T)
print(data.T.shape)
```

(Optional) Pandas introduces some new types

# DataFrame type
type(data)
type(data.T)

# Series type
type(data['gdpPercap_1952'])

# Index type
type(data.columns)

You can convert data between NumPy arrays, Series, and DataFrames
You can read data into any of the data structures from files or from standard Python containers

Beginner Challenge

Read the data in gapminder_gdp_americas.csv into a variable called americas and display its summary statistics.
After reading the data for the Americas, use help(americas.head) and help(americas.tail) to find out what DataFrame.head and DataFrame.tail do.
1. How can you display the first three rows of this data?
2. How can you display the last three columns of this data? (Hint: You may need to change your view of the data).
As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write DataFrames to files. Applying what you’ve learned about reading from files, write one of your DataFrames to a file called processed.csv. You can use help to get information on how to use to_csv.

Solution

americas = pd.read_csv('data/gapminder_gdp_americas.csv', index_col='country')
americas.describe()
americas.head(3)
americas.T.tail(3)
americas.to_csv('processed.csv')

Subsetting Data

Treat the data frame as a matrix and select values by position

Use DataFrame.iloc[..., ...] to select values by their (entry) position. The i in iloc stands for "index".

import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

data.iloc[0,0]

Treat the data frame as a table and select values by label

This is most common way to get data

Use DataFrame.loc[..., ...] to select values by their label

# This returns a value
data.loc["Albania", "gdpPercap_1952"]

Shorten the column names using vectorized string methods

print(data.columns)

# The columns index can update all of its values in a single operation
data.columns = data.columns.str.strip("gdpPercap_")
print(data.columns)

Use list slicing notation to get subsets of the data frame

Select multiple columns or rows using .loc and a named slice. This generalizes the concept of a slice to include labeled indexes.
```
# This returns a DataFrame
data.loc['Italy':'Poland', '1962':'1972']
```
Use : on its own to mean all columns or all rows. This is Python’s usual slicing notation, which allows you to treat data frames as multi-dimensional lists.
```
# This returns a DataFrame
data.loc['Italy':'Poland', :]
```
(Optional) If you want specific rows or columns, pass in a list
```
data.loc[['Italy','Poland'], :]
```

.iloc follows list index conventions ("up to, but not including)", but .loc does the intuitive right thing ("A through B")

index_subset = data.iloc[0:2, 0:2]
label_subset = data.loc["Albania":"Belgium", "1952":"1962"]

print(index_subset)
print(label_subset)

Result of slicing can be used in further operations

subset = data.loc['Italy':'Poland', '1962':'1972']

print(subset.describe())
print(subset.max())

Insert new values using .at (for label indexing) or .iat (for numerical indexing)
```
subset.at["Italy", "1962"] = 2000
print(subset)
```

Challenge: Collection types

Calculate subset.max() and assign the result to a variable. What kind of thing is it? What are its properties?
What is the maximum value of the new variable? Can you determine this without creating an intermediate variable?

Solution

Pandas always drills down to the most parsimonious representation. On one hand, this is convenient; on the other, it violates the Pythonic expectation for strong types.

Shape of data selection Pandas return type

2D DataFrame

1D Series

0D single value
Use method chaining
```
print(subset.max().max())
```

(Optional) Filter on label properties

.filter() always returns the same type as the original item, whereas .loc and .iloc might return a data frame or a series.
```
italy = data.filter(items=["Italy"], axis="index")
print(italy)
print(type(italy))
```

.filter() is a general-purpose, flexible method

help(data.filter)
data.filter(like="200", axis="columns")
data.filter(like="200", axis="columns").filter(items=["Italy"], axis="index")

Filtering (i.e. masking) data on contents

Use comparisons to select data based on value

Show which data frame elements match a criterion.

# Which GDPs are greater than 10,000?
subset > 10000

Use the criterion match to filter the data frame's contents. This uses index notation:
```
fs = subset[subset > 10000]
print(fs)
```
1. subset > 10000 returns a data frame of True/False values
2. subset[subset > 10000] filters its contents based on that True/False data frame
3. This section is more properly called "Masking Data," because it involves operations for overlaying a data frame's values without changing the data frame's shape. We don't drop anything from the data frame, we just replace it with NaN.
(Optional) Use .where() method to find elements that match the criterion:
```
fs = subset.where(subset > 10000)
print(fs)
```

You can filter using any method that returns a data frame

# GDP for all countries greater than the median
subset[subset > subset.median()]

# OR: subset.where(subset > subset.median())

Use method chaining to create final output without creating intermediate variables

# The .rank() method turns numerical scores into ranks
subset.rank()

# GDP ranking for all countries greater than the median
subset[subset > subset.median()].rank()

# OR: subset.where(subset > subset.median()).rank()

Working with missing data

By default, most numerical operations ignore missing data

Examples include min, max, mean, std, etc.

Missing values ignored by default

print("Column means")
print(fs.mean())

print("Row means")
print(fs.mean(axis=1))

Force inclusions with the skipna argument

print("Column means")
print(fs.mean(skipna=False))

print("Row means")
print(fs.mean(axis=1, skipna=False))

Check for missing values

Show which items are missing. "NA" includes NaN and None. It doesn't include empty strings or numpy.inf.
```
# Show which items are NA
fs.isna()
```

Count missing values

# Missing by row
print(fs.isna().sum())

# Missing by column
print(fs.isna().sum(axis=1))

# Aggregate sum
fs.isna().sum().sum()

Are any values missing?
```
fs.isna().any(axis=None)
```
(Optional) Are all of the values missing?
```
fs.isna().all(axis=None)
```

Replace missing values

Replace with a fixed value

fs_fixed = fs.fillna(99)
print(fs_fixed)

Replace values that don't meet a criterion with an alternate value

subset_fixed = subset.where(subset > 10000, 99)
print(subset_fixed)

(Optional) Impute missing values. Read the docs, this may or may not be sufficient for your needs.
```
fs_imputed = fs.interpolate()
```

Drop missing values

Drop all rows with missing values

fs_drop = fs.dropna()

Challenge: Filter and trim with a boolean vector

A DataFrame is a dictionary of Series columns. With this in mind, experiment with the following code and try to explain what each line is doing. What operation is it performing, and what is being returned?

Feel free to use print(), help(), type(), etc as you investigate.

fs["1962"]
fs["1962"].notna()
fs[fs["1962"].notna()]

Solution

Line 1 returns the column as a Series vector
Line 2 returns a boolean Series vector (True/False)
Line 3 performs boolean indexing on the DataFrame using the Series vector. It only returns the rows that are True (i.e. it performs true filtering).

Sorting and grouping

Motivating example: Calculate the wealth Z-score for each country

# Calculate z scores for all elements
z = (data - data.mean())/data.std()

# Get the mean z score for each country (i.e. across all columns)
mean_z = z.mean(axis=1)

# Group countries into "wealthy" (z > 0) and "not wealthy" (z <= 0)
z_bool = mean_z > 0

print(mean_z)
print(z_bool)

Append new columns to the data frame containing our summary statistics

Data frames are dictionaries of Series:

data["mean_z"] = mean_z
data["wealthy"] = z_bool

Sort and group by new columns

data.sort_values(by="mean_z")

# Get descriptive statistics for the group
data.groupby("wealthy").mean()
data.groupby("wealthy").describe()

Write output

Capture the results of your filter in a new file, rather than overwriting your original data.

# Save to a new CSV, preserving your original data
data.to_csv('gapminder_gdp_europe_normed.csv')

# If you don't want to preserve row names:
#data.to_csv('gapminder_gdp_europe_normed.csv', index=False)

Working with multiple tables

Concatenating data frames

surveys = pd.read_csv('data/surveys.csv', index_col="record_id")
print(surveys.shape)

df1 = surveys.head(10)
df2 = surveys.tail(10)

df3 = pd.concat([df1, df2])
print(df3.shape)

(Optional) Joining data frames (in an SQL-like manner)

Import species data

species = pd.read_csv('data/species.csv', index_col="species_id")
print(species.shape)

Join tables on common column. The "left" join is a strategy for augmenting the first table (surveys) with information from the second table (species).
```
df_join = surveys.merge(species, on="species_id", how="left")
print(df_join.head())
print(df_join.shape)
```

The resulting table loses its index because surveys.record_id is not being used in the join. To keep record_id as the index for the final table, we need to retain it as an explicit column.

# Don't set record_id as index during initial import
surveys = pd.read_csv('data/surveys.csv')
df_join = surveys.merge(species, on="species_id", how="left").set_index("record_id")

df_join.index

Get the subset of species that match a criterion, and join on that subset. The "inner" join only includes rows where both tables match on the key column; it's a strategy for filtering the first table by the second table.

# Get the taxa column, masking the rows based on which values match "Bird"
birds = species[species["taxa"] == "Bird"]
df_birds = surveys.join(birds, on="species_id").set_index("record_id")

print(df_birds.head())
print(df_birds.shape)

(Optional) Text processing in Pandas

cf. https://pandas.pydata.org/docs/user_guide/text.html

Import tabular data that contains strings

species = pd.read_csv('data/species.csv', index_col='species_id')

# You can explicitly set all of the columns to type string
# species = pd.read_csv('data/species.csv', index_col='species_id', dtype='string')

# ...or specify the type of individual columns
# species = pd.read_csv('data/species.csv', index_col='species_id',
#                       dtype = {"genus": "string",
#                                "species": "string",
#                                "taxa": "string"})

print(species.head())
print(species.info())
print(species.describe())

A Pandas Series has string methods that operate on the entire Series at once

# Two ways of getting an individual column
print(type(species.genus))
print(type(species["genus"]))

# Inspect the available string methods
print(dir(species["genus"].str))

Use string methods for filtering

# Which species are in the taxa "Bird"?
print(species["taxa"].str.startswith("Bird"))

# Filter the dataset to only look at Birds
print(species[species["taxa"].str.startswith("Bird")])

Use string methods to transform and combine data

binomial_name = species["genus"].str.cat(species["species"].str.title(), " ")
species["binomial"] = binomial_name

print(species.head())

(Optional) Adding rows to DataFrames

A row is a view onto the nth item of each of the column Series. Appending rows is a performance bottleneck because it requires a separate append operation for each Series. You should concatenate data frames instead.

Create a single row as a data frame and concatenate it.

row = pd.DataFrame({"1962": 5000, "1967": 5000, "1972": 5000}, index=["Latveria"])
pd.concat([subset, row])

If you have individual rows as Series, pd.concat() will produce a data frame.

# Get each row as a Series
italy = data.loc["Italy", :]
poland = data.loc["Poland", :]

# Omitting axis argument (or axis=0) concatenates the 2 series end-to-end
# axis=1 creates a 2D data frame
# Transpose recovers original orientation
# Column labels come from Series index
# Row labels come from Series name
pd.concat([italy, poland], axis=1).T

(Optional) Scientific Computing Libraries

Libraries

SciPy projects
1. Numpy: Linear algebra
2. Pandas
3. Scipy.stats: Probability distributions and basic tests
Statsmodels: Statistical models and formulae built on Scipy.stats
Scikit-Learn: Machine learning tools built on NumPy
Tensorflow/PyTorch: Deep learning and other voodoo

The basics of Scikit-Learn

Scikit-Learn documentation: https://scikit-learn.org/stable/

Motivating example: Ordinary least squares regression

from sklearn import linear_model

# Create some random data
x_train = np.random.rand(20)
y = np.random.rand(20)

# Fit a linear model
reg = linear_model.LinearRegression()
reg.fit(x_train.reshape(-1,1), y)
print("Regression slope:", reg.coef_)

Estimate model fit

from sklearn.metrics import r2_score

# Test model fit with new data
x_test = np.random.rand(20)
y_prediction = reg.predict(x_test.reshape(-1,1))

# Get model stats
mse = mean_squared_error(y, y_prediction)
r2 = r2_score(y, y_prediction)

print("R squared:", "{:.3f}".format(r2))

(Optional) Inspect our prediction

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(x_train, y, color="black")
ax.plot(x_test, y_prediction, color="blue")

# `fig` in Jupyter Lab
fig.show()

(Optional) Compare with Statsmodels

# Load modules and data
import statsmodels.api as sm

# Fit and summarize OLS model (center data to get accurate model fit
mod = sm.OLS(y - y.mean(), x_train - x_train.mean())
res = mod.fit()

print(res.summary())

(Optional) Statsmodels regression example with applied data

Import data

data = pd.read_csv('surveys.csv')

# Check for NaN
print("Valid weights:", data['weight'].count())
print("NaN weights:", data['weight'].isna().sum())
print("Valid lengths:", data['hindfoot_length'].count())
print("NaN lengths:", data['hindfoot_length'].isna().sum())

Fit OLS regression model

from statsmodels.formula.api import ols

model = ols("weight ~ hindfoot_length", data, missing='drop').fit()
print(model.summary())

Generic parameters for all models

import statsmodels

help(statsmodels.base.model.Model)

(Optional) Things we didn't talk about

pipe
map/applymap/apply (in general you should prefer vectorized functions)

(Optional) Pandas method chaining in the wild

https://gist.githubusercontent.com/adiamaan92/d8ebee8937d271452def2a7314993b2f/raw/ce9fbb5013d94accf0779a25e182c4be77678bd0/wine_mc_example.py

wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]

(Optional) Introspecting on the DataFrame object

DataFrames have a huge number of fields and methods, so dir() is not very useful
```
print(dir(data))
```

Create a new list that filters out internal attributes

df_joinpublic = [item for item in dir(data) if not item.startswith('_')]
print(df_public)

(Optional) Pretty-print the new list

importort pprint

pp = pprint.PrettyPrinter(width=100, compact=True, indent=2)
pp.pprint(df_public)

Objects have fields (i.e. data/variables) and methods (i.e. functions/procedures). The difference between a method and a function is that methods are attached to objects, whereas functions are free-floating ("first-class citizens"). Methods and functions are "callable":

# GeneratorExitenerate a list of public methods and a list of public fields. We do this
# by testing each attribute to determine whether it is "callable".
# NB: Because Python allows you to override any attribute at runtime,
# testing with `callable` is not always reliable.

# List of methods (callable attributes)
df_methods = [item for item in dir(data) if not item.startswith('_')
              and callable(getattr(data, item))]
# List of fields (non-callable attributes)
df_attr = [item for item in dir(data) if not item.startswith('_')
           and not callable(getattr(data, item))]

pp.pprint(df_methods)
pp.pprint(df_attr)

(Carpentries version) Group By: split-apply-combine

Split data according to criterion, do numeric transformations, then recombine.

# Get all GDPs greater than the mean
mask_higher = data > data.mean()

# Count the number of time periods in which each country exceeds the mean
higher_count = mask_higher.aggregate('sum', axis=1)

# Create a normalized wealth-over-time score
wealth_score = higher_count / len(data.columns)
wealth_score

A DataFrame is a spreadsheet, but it is also a dictionary of columns.
```
data['gdpPercap_1962']
```

Add column to data frame

# Warningealth Score is a series
type(wealth_score)

data['normalized_wealth'] = wealth_score

Building Programs (Week 3)

Notebooks vs Python scripts

Differences between .ipynb and .py

Export notebook to .py file
Move .py file into data directory
Compare files in TextEdit/Notepad

Workflow differences between notebooks and scripts

Broadly, a trade-off between managing big code bases and making it easy to experiment. See: https://github.com/elliewix/Ways-Of-Installing-Python/blob/master/ways-of-installing.md#why-do-you-need-a-specific-tool

Interactive testing and debugging
Graphics integration
Version control
Remote scripts

(Optional) Python from the terminal

Python is an interactive interpreter (REPL)
```
python
```

Python is a command line program

# hello.py
print("Hello!")

python hello.py

(Optional) Python programs can accept command line arguments as inputs
1. List of command line inputs: sys.argv (https://docs.python.org/3/library/sys.html#sys.argv)
2. Utility for working with arguments: argparse (https://docs.python.org/3/library/argparse.html)

Looping Over Data Sets

File paths as an example of increasing abstraction in program development

File paths as literal strings
File paths as string patterns
File paths as abstract Path objects

Use a `for` loop to process files given a list of their names

import pandas as pd

file_list = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']
for filename in file_list:
    data = pd.read_csv(filename, index_col='country')
    print(filename)
    print(data.head(1))

Use glob.glob to find sets of files whose names match a pattern

Get a list of all the CSV files
```
import glob
glob.glob('data/*.csv')
```
In Unix, the term “globbing” means “matching a set of files with a pattern”. It uses shell expansion rules, not regular expressions, so there's an upper limit to how flexible it can be. The most common patterns are:
- `*` meaning “match zero or more characters”
- `?` meaning “match exactly one character”
Get a list of all the Gapminder CSV files
```
glob.glob('data/gapminder_*.csv')
```
Exclude the "all" CSV file
```
glob.glob('data/gapminder_[!all]*.csv')
```

Use glob and a `for` loop to process batches of files

data_frames = []
for filename in glob.glob('data/gapminder_[!all]*.csv'):
    data = pd.read_csv(filename)
    data_frames.append(data)

all_data = pd.concat(data_frames)
print(all_data.shape)

Conditionals

Evaluating the truth of a statement

Value of a variable

mass = 3

print(mass == 3)
print(mass > 5)
print(mass < 4)

Membership in a collection

primes = [2, 3, 5]

print(2 in primes)
print(7 in primes)

Truth of a collection Note that any() and all() evaluate each item using .__bool__() or .__len()__, which tells you whether an item is "truthy" or "falsey" (i.e. interpreted as being true or false).
```
my_list = [2.75, "green", 0]

print(any(my_list))
print(all(my_list))
```
(Optional) Understanding "truthy" and "falsey" values in Python (cf. https://stackoverflow.com/a/53198991) Every value in Python, regardless of type, is interpreted as being True except for the following values (which are interpreted as False). "Truthy" values satisfy if or while statements; "Falsey" values do not.
1. Constants defined to be false: None and False.
2. Zero of any numeric type: 0, 0.0, 0j, Decimal(0), Fraction(0, 1)
3. Empty sequences and collections: '', (), [], {}, set(), range(0)

Use `if` statements to control whether or not a block of code is executed

An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.

mass = 3.5
if mass > 3.0:
    print(mass, 'is large')

mass = 2.0
if mass > 3.0:
    print (mass, 'is large')

Structure is similar to a for statement:

First line opens with if and ends with a colon
Body containing one or more statements is indented (usually by 4 spaces)

Use else to execute a block of code when an if condition is not true

else can be used following an if. This allows us to specify an alternative to execute when the if branch isn’t taken.

if m > 3.0:
    print(m, 'is large')
else:
    print(m, 'is small')

Use `elif` to specify additional tests

May want to provide several alternative choices, each with its own test; use elif (short for “else if”) and a condition to specify these.

if m > 9.0:
    print(m, 'is HUGE')
elif m > 3.0:
    print(m, 'is large')
else:
    print(m, 'is small')

Always associated with an if.
Must come before the else (which is the “catch all”).

Conditions are tested once, in order

Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:

grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')

Compound Relations Using `and`, `or`, and Parentheses

Often, you want some combination of things to be true. You can combine relations within a conditional using and and or. Continuing the example above, suppose you have:

mass     = [ 3.54,  2.07,  9.22,  1.86,  1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]

for m, v in zip(mass, velocity):
    if m > 5 and v > 20:
        print("Fast heavy object.  Duck!")
    elif m > 2 and m <= 5 and v <= 20:
        print("Normal traffic")
    elif m <= 2 and v <= 20:
        print("Slow light object.  Ignore it")
    else:
        print("Whoa!  Something is up with the data.  Check it")

Use () to group subsets of conditions
Aside: For a more natural way of working with many lists, look at zip()

Use the modulus to print occasional status messages

Conditionals are often used inside loops.

data_frames = []
for count, filename in enumerate(glob.glob('data/gapminder_[!all]*.csv')):
    # Print every other filename
    if count % 2 == 0:
        print(count, filename)
    data = pd.read_csv(filename)
    data_frames.append(data)

all_data = pd.concat(data_frames)
print(all_data.shape)

Challenge: Process small files

Iterate through all of the CSV files in the data directory. Print the file name and file length for any file that is less than 30 lines long.

Solution

for filename in glob.glob('data/*.csv'):
    data = pd.read_csv(filename)
    if len(data) < 30:
        print(filename, len(data))

(Optional) Use pathlib to write code that works across operating systems

Pathlib provides cross-platform path objects

from pathlib import Path

# Create Path objects
raw_path = Path("data")
processed_path = Path("data/processed")

print("Relative path:", raw_path)
print("Absolute path:", raw_path.absolute())

The file objects have methods that provide much better information about files and directories.

#Note the careful testing at each level of the code.
data_frames = []

if relative_path.exists():
    for filename in raw_path.glob('gapminder_[!all]*.csv'):
        if filename.is_file():
            data = pd.read_csv(filename)
            print(filename)
            data_frames.append(data)

all_data = pd.concat(data_frames)

# Check for destination folder and create if it doesn't exist
if not processed_path.exists():
    processed_path.mkdir()

all_data.to_csv(processed_path.joinpath("combined_data.csv"))

(Optional) Generic file handling

Pandas understands specific file types, but what if you need to work with a generic file?

Open the file with a context manager

with open("data/bouldercreek_09_2013.txt", "r") as infile:
    lines = infile.readlines()

The context manager closes the file when you're done reading it
"bouldercreek_09_2013.txt" is the name of the file
infile is a variable that refers to the file on disk

A file is a collection of lines

.readlines() produces the file contents as a list of lines; each line is a string.

print(len(text))
print(type(text))

# View the first 10 lines
print(text[:10])

Strings contain formatting marks

Compare the following:

# This displays the nicely-formatted document
print(lines[0])

# This shows the true nature of the string; you can see newlines (/n),
# tabs (/t), and other hidden characters
lines[0]

(Optional) Text processing and data cleanup

Use string methods to determine which lines to keep

The file contains front matter that we can discard

tabular_lines = []
for line in lines:
    if not line.startswith("#"):
        tabular_lines.append(line)

Now the first line is tab-separated data. Note that the print statement prints the tabs instead of showing us the \t character.
```
tabular_lines[0]
```

Open an output file for writing

outfile_name = "data/tabular_data.txt"

with open(outfile_name, "w") as outfile:
    outfile.writelines(tabular_lines)

Format output as a comma-delimited text file

Strip trailing whitespace

stripped_line = tabular_lines[0].strip()
stripped_line

Split each line into a list based using the tabs.

split_line = stripped_line.split("\t")
split_line

Use a special-purpose library to create a correctly-formatted CSV file

import csv

outfile_name = "data/csv_data.csv"
with open(outfile_name, "w") as outfile:
    writer = csv.writer(outfile)
    for line in tabular_lines:
        csv_line = line.strip().split("\t")
        writer.writerow(csv_line)

You can initialize csv.reader and csv.writer with different "dialects" or with custom delimiters and quotechars; see https://docs.python.org/3/library/csv.html

(Optional) Avoid memory limitations by processing the input file one line at a time

infile_name = "data/bouldercreek_09_2013.txt"
outfile_name = "data/csv_data.csv"

with open(infile_name, "r") as infile, open(outfile_name, "w") as outfile:
    writer = csv.writer(outfile)
    for line in infile:
        if not line.startswith("#"):
            writer.writerow(line.strip().split("\t"))

(Optional) Notes

Pandas has utilities for reading fixed-width files: https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

Saving datasets with new-style string formatting

for i in datasets_list:
   do_something(f'{i}.png'

Writing Functions

Break programs down into functions to make them easier to understand

Human beings can only keep a few items in working memory at a time.
Understand larger/more complicated ideas by understanding and combining pieces
Functions serve the same purpose in programs:
1. Encapsulate complexity so that we can treat it as a single “thing”
2. Removes complexity from remaining code, making it easier to test
3. Enables re-use: Write one time, use many times

Define a function using `def` with a name, parameters, and a block of code

def print_greeting():
    print('Hello!')

Begin the definition of a new function with def, followed by the name of the function.
Must obey the same rules as variable names.
Parameters in parentheses; empty parentheses if the function doesn’t take any inputs.
Indent function body

Defining a function does not run it

print_greeting()

Like assigning a value to a variable
Must call the function to execute the code it contains.

Arguments in call are matched to parameters in definition

Positional arguments

def print_date(year, month, day):
    joined = '/'.join([year, month, day])
    print(joined)

print_date(1871, 3, 19)

(Optional) Keyword arguments
```
print_date(month=3, day=19, year=1871)
```

Functions may return a result to their caller using `return`

Use return ... to give a value back to the caller. return ends the function's execution and returns you to the code that originally called the function.

def average(values):
    """Return average of values, or None if no values are supplied."""

    if len(values) == 0:
        return None
    else:
        return sum(values) / len(values)

a = average([1, 3, 4])
print(a)

You should explicitly handle common problems:
```
print(average([]))
```
Notes:
1. return can occur anywhere in the function, but functions are easier to understand if return occurs:
  1. At the start to handle special cases
  2. At the very end, with a final result
2. Docstring provides function help. Use triple quotes if you need the docstring to span multiple lines.

Challenge (text processing): Encapsulate text processing in a function

Write a function that takes line as an input and returns the information required by writer.writerow().

Challenge (data normalization): Encapsulate Z score calculations in a function

Write a function that encapsulates the Z-score calculations from the Pandas workshop into a function. The function should return two Series:
1. The mean Z score for each country over time
2. A categorical variable that identifies countries as "wealthy" or "non-wealthy"
Use the function to inspect one of the Gapminder continental datasets.

Solution

def norm_data(data):
    """Add a Z score column to each data set."""

    # Calculate z scores for all elements
    z = (data - data.mean())/data.std()

    # Get the mean z score for each country
    mean_z = z.mean(axis=1)

    # Group countries into "wealthy" (z > 0) and "not wealthy" (z <= 0)
    z_bool = mean_z > 0

    return mean_z, z_bool

data = pd.read_csv("data/gapminder_gdp_europe.csv", index_col = "country")
mean_z, z_bool = norm_data(data)

# If you need to drop the contintent column
# mean_z, z_bool = norm_data(data.drop("continent", axis=1))

(Optional) Use the function to process all files

for filename in glob.glob('data/gapminder_*.csv'):
    # Print a status message
    print("Current file:", filename)

    # Read the data into a DataFrame and modify it
    data = pd.read_csv(filename, index_col = "country")
    mean_z, z_bool = norm_data(data)

    # Append to DataFrame
    data["mean_z"] = mean_z
    data["wealthy"] = z_bool

    # Generate an output file name
    parts = filename.split(".csv")
    newfile = ''.join([parts[0], "_normed.csv"])
    data.to_csv(newfile)

(Optional) A worked example: The Lorenz attractor

https://matplotlib.org/stable/gallery/mplot3d/lorenz_attractor.html

(Carpentries version) Conditionals

Use `if` statements to control whether or not a block of code is executed

An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.

mass = 3.54
if mass > 3.0:
    print(mass, 'is large')

mass = 2.07
if mass > 3.0:
    print (mass, 'is large')

Structure is similar to a for statement:

First line opens with if and ends with a colon
Body containing one or more statements is indented (usually by 4 spaces)

Conditionals are often used inside loops

Not much point using a conditional when we know the value (as above), but useful when we have a collection to process.

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')

Use else to execute a block of code when an if condition is not true

else can be used following an if. This allows us to specify an alternative to execute when the if branch isn’t taken.

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

Use `elif` to specify additional tests

May want to provide several alternative choices, each with its own test; use elif (short for “else if”) and a condition to specify these.

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 9.0:
        print(m, 'is HUGE')
    elif m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

Always associated with an if.
Must come before the else (which is the “catch all”).

Conditions are tested once, in order

Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:

grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')

Use conditionals in a loop to “evolve” the values of variables

velocity = 10.0
for i in range(5): # execute the loop 5 times
    print(i, ':', velocity)
    if velocity > 20.0:
        velocity = velocity - 5.0
    else:
        velocity = velocity + 10.0
print('final velocity:', velocity)

This is how dynamical systems simulations work

Compound Relations Using `and`, `or`, and Parentheses (optional)

Often, you want some combination of things to be true. You can combine relations within a conditional using and and or. Continuing the example above, suppose you have:

mass     = [ 3.54,  2.07,  9.22,  1.86,  1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]

i = 0
for i in range(5):
    if mass[i] > 5 and velocity[i] > 20:
        print("Fast heavy object.  Duck!")
    elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
        print("Normal traffic")
    elif mass[i] <= 2 and velocity[i] <= 20:
        print("Slow light object.  Ignore it")
    else:
        print("Whoa!  Something is up with the data.  Check it")

Use () to group subsets of conditions
Aside: For a more natural way of working with many lists, look at zip()

Visualization with Matplotlib and Seaborn (Week 4)

Orientation

Briefly revisit week 1

Python orientation
Jupyter orientation

A brief history of plotting in Matplotlib

Multiple interfaces
Local graphs and global settings
Matplotlib is the substrate for higher-level libraries
Drawing things is verbose in any language

Plotting with Matplotlib

The basic plot

import matplotlib.pyplot as plt
fig, ax = plt.subplots()

time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

ax.plot(time, position)

Two kinds of plotting objects

type(fig)

print(type(fig))
print(type(ax))

Figure objects handle display, printing, saving, etc.
Axes objects contain graph information

(Optional) Three ways of showing a figure

Show figure inline (Jupyter Lab default)
```
fig
```
Show figure in a separate window (command line default)
```
fig.show()
```
Show figure in a separate window from Jupyter Lab. You may need to specify a different "backend" parameter for matplotlib.use() depending on your exact setup: https://matplotlib.org/stable/tutorials/introductory/usage.html#the-builtin-backends
```
import matplotlib

matplotlib.use('TkAgg')

fig.show()
```

The lifecycle of a custom plot

Create mock data

import numpy as np

y = np.random.random(10) # outputs an array of 10 random numbers between 0 and 1
x = np.arange(1980,1990,1) # generates an ordered array of numbers from 1980 to 1989

# Check that x and y contain the same number of values
assert len(x) == len(y)

Inspect our data
```
print("x:", x)
print("y:", y)
```

Create the basic plot

# Convert y axis into a percentage
y = y * 100

# Draw plot
fig, ax = plt.subplots()
ax.plot(x, y)

Show available styles

# What are the global styles?
plt.style.available

# Set a global figure style
plt.style.use("dark_background")

# The style is only applied to new figures, not pre-existing figures
fig

# Re-creating the figure applies the new style
fig, ax = plt.subplots()
ax.plot(x, y)

Customize the graph In principle, nearly every element on a Matplotlib figure is independently modifiable.

# Set figure size
fig, ax = plt.subplots(figsize=(8,6))

# Set line attributes
ax.plot(x, y, color='darkorange', linewidth=2, marker='o')

# Add title and labels
ax.set_title("Percent Change in Stock X", fontsize=22, fontweight='bold')
ax.set_xlabel(" Years ", fontsize=20, fontweight='bold')
ax.set_ylabel(" % change ", fontsize=20, fontweight='bold')

# Adjust the tick labels
ax.tick_params(axis='both', which='major', labelsize=18)

# Add a grid
ax.grid(True)

Save your figure

fig.savefig("mygraph_dark.png", dpi=300)

Plotting multiple data sets

In this example, plot GDP over time for multiple countries.

Import data

import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

# Inspect our data
data.head(3)

Transform column headers into an ordinal scale
1. (Optional) Original column names are object (i.e. string) data
```
data.columns
```
2. Strip off non-numeric portion of each column title
```
years = data.columns.str.strip('gdpPercap_')
years
```
3. Convert years strings into integers and replace original data frame column headers
```
data.columns = years.astype(int)
```

Extract rows from the DataFrame

x_years = data.columns
y_austria = data.loc['Austria']
y_bulgaria = data.loc['Bulgaria']

Create the plot object

# Change global background back to default
plt.style.use("default")

# Create GDP figure
fig, ax = plt.subplots(figsize=(8,6))

# Create GDP plot
ax.plot(x_years, y_austria, label='Austria', color='darkgreen', linewidth=2, marker='x')
ax.plot(x_years, y_bulgaria, label='Bulgaria', color='maroon', linewidth=2, marker='o')

# Decorate the plot
ax.legend(fontsize=16, loc='upper center') #automatically uses labels
ax.set_title("GDP of Austria vs Bulgaria", fontsize=22, fontweight='bold')
ax.set_xlabel("Years", fontsize=20, fontweight='bold')
ax.set_ylabel("GDP", fontsize=20, fontweight='bold')

(Optional) Plot directly from Pandas

Don't do this.

The basic plot syntax

ax = data.loc['Austria'].plot()
fig = ax.get_figure()
fig

Decorate your Pandas plot

ax = data.loc['Austria'].plot(figsize=(8,6), color='darkgreen', linewidth=2, marker='*')
ax.set_title("GDP of Austria", fontsize=22, fontweight='bold')
ax.set_xlabel("Years",fontsize=20, fontweight='bold' )
ax.set_ylabel("GDP",fontsize=20, fontweight='bold' )

fig = ax.get_figure()
fig

Overlaying multiple plots on the same figure with Pandas. This is super unintuitive.

# Create an Axes object with the Austria data
ax = data.loc['Austria'].plot(figsize=(8,6), color='darkgreen', linewidth=2, marker='*')
print("Austria graph", id(ax))

# Overlay the Bulgaria data on the same Axes object
ax = data.loc['Bulgaria'].plot(color='maroon', linewidth=2, marker='o')
print("Bulgaria graph", id(ax))

The equivalent Matplotlib plot (optional)

# extract the x and y values from dataframe
x_years = data.columns
y_gdp = data.loc['Austria']

# Create the plot
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x_years, y_gdp, color='darkgreen', linewidth=2, marker='x')
# etc.

Visualization Strategy

There are many kinds of plots

## Visualize the same data using a scatterplot
plt.style.use('ggplot')

# Create a scatter plot
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(y_austria, y_bulgaria, color='blue', linewidth=2, marker='o')

# Decorate the plot
ax.set_title("GDP of Austria vs Bulgaria", fontsize=22, fontweight='bold')
ax.set_xlabel("GDP of Austria",fontsize=20, fontweight='bold' )
ax.set_ylabel("GDP of Bulgaria",fontsize=20, fontweight='bold' )

Read the docs

Matplotlib gallery: https://matplotlib.org/stable/gallery/index.html
1. "Plotting categorical variables" example of multiple subplots
2. Download code examples
3. .py vs .ipynb
Matplotlib tutorials: https://matplotlib.org/stable/tutorials/index.html
Seaborn gallery: https://seaborn.pydata.org/examples/index.html
Seaborn tutorials: https://seaborn.pydata.org/tutorial.html

Workflow strategy

Get in the ball park
Look at lots of data
Try lots of presets
Customize judiciously
Build collection of interactive and publication code snippets

Fast visualization and theming with Seaborn

Seaborn is a set of high-level pre-sets for Matplotlib.

Seaborn is a nice way to look at your data

# Import the Seaborn library
import seaborn as sns

ax = sns.lineplot(data=data.T, legend=False, dashes=False)

Doing more with this data set requires transforming the data from wide form to long form; see https://seaborn.pydata.org/tutorial/data_structure.html

Using preset styles

Let's make a poster!

Import Iris data set https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv
```
iris = pd.read_csv("../data/iris.csv")
iris.head()
```

Create a basic scatter plot

ax = sns.scatterplot(data=iris, x='sepal_length',y='petal_length')

Change plotting theme

plt.style.use("dark_background")

# Fix grid if necessary
#plt.rcParams["axes.grid"] = False

# Make everything visible at a distance
sns.set_context('poster')

# Color by species
ax = sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species',
                     palette='colorblind', size='petal_width')

# Place legend
ax.legend(bbox_to_anchor=(2,1))

Read more about loc vs. bbox_to_anchor in the legend documentation: https://matplotlib.org/stable/api/legend_api.html

The Seaborn plot uses Matplotlib under the hood

# Set the figure size
fig = ax.get_figure()
fig.set_size_inches(8,6)

fig

(Optional) There are many styling options

Add styling to individual points

ax = sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species', palette='colorblind', style='species')

Prettify column names

words = [' '.join(i) for i in iris.columns.str.split('_')]
iris.columns = words

Make a regression plot

# Color by species, size by petal width
ax = sns.regplot(data=iris, x='sepal_length', y='petal_length', scatter=True,
                 scatter_kws={'color':'white'})

(Optional) Bar Charts

Bar Plot
```
ax = sns.barplot(data=iris, x='species', y='sepal_width', palette='colorblind')
```
- Default summary statistic is mean, and default error bars are 95% confidence interval.

Add custom parameters

# Error bars show standard deviation
ax = sns.barplot(data=iris, x='species', y='sepal_width', ci='sd', edgecolor='black')

(Optional) count plot counts the records in each category

ax = sns.countplot(data=iris, x='species', palette='colorblind')

(Optional) Histograms

Histogram of overall data set
```
ax = sns.histplot(data=iris, x='petal_length', kde=True)
```
- KDE: If True, compute a kernel density estimate to smooth the distribution and show on the plot as (one or more) line(s).
- There seems a bimodal distribution of petal length. What factors underly this distribution?

Histogram of data decomposed by category

ax = sns.histplot(data=iris, x='petal_length', hue='species', palette='Set2')

Create multiple subplots to compare binning strategies

# This generates 3 subplots (ncols=3) on the same figure
fig, axes = plt.subplots(figsize=(12,4), nrows=1, ncols=3)

# Note that we can use Seaborn to draw on our Matplotlib figure
sns.histplot(data=iris,x='petal_length', bins=5, ax=axes[0], color='#f5a142')
sns.histplot(data=iris,x='petal_length', bins=10, ax=axes[1], color='maroon')
sns.histplot(data=iris,x='petal_length', bins=15, ax=axes[2], color='darkmagenta')

(Optional) Box Plots and Swarm Plots

Box plot

ax = sns.boxplot(data=iris, x='species', y='petal_length')

Swarm plot

ax = sns.swarmplot(data=iris,x='species', y='petal_length', hue='species', palette='Set1')
ax.legend(loc='upper left', fontsize=16)
ax.tick_params(axis='x', labelrotation = 45)

This gives us a format warning.

Strip plot

ax = sns.swarmplot(data=iris,x='species', y='petal_length', hue='species', palette='Set1')
ax.legend(loc='upper left', fontsize=16)
ax.tick_params(axis='x', labelrotation = 45)

Overlapping plots

ax = sns.boxplot(data=iris, x='species', y='petal_length')
sns.stripplot(data=iris, x='species', y='petal_length', ax=ax, palette='Set1')

(Optional) How Matplotlib works

Understanding Matplotlib

Everything is an Artist (object)
Multiple levels of specificity
- plt vs axes
- rcParams vs temporary stylings
Simplified high-level interfaces, aka "syntactic sugar"
- legend() vs get legend handles and patches

Matplotlib object syntax

The object.set_field(value) usage is taken from Java, which was popular in 2003 when Matplotlib was developing its object-oriented syntax
You get values back out with object.get_field(value)
The Pythonic way to set a value would be object.field = value. However, the Matplotlib getters and setters do a lot of internal bookkeeping, so if you try to set field values directly you will get errors. For example, compare ax.get_ylabel() with ax.yaxis.label.
Read "The Lifecycle of a Plot": https://matplotlib.org/stable/tutorials/introductory/lifecycle.html
Read "Why you hate Matplotlib": https://ryxcommar.com/2020/04/11/why-you-hate-matplotlib/

Special Topics

Working with unstructured files

Open the file with a context handler

with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
    text = file_in.read()

print(len(text))

Strings contain formatting marks

Compare the following:

# This displays the nicely-formatted document
print(text[:300])

# This shows the true nature of the string; you can see newlines (/n),
# tabs (/t), and other hidden characters
text[:300]

Many ways of handling a file

`.read()` produces the file contents as one string

type(text)

`.readlines()` produces the file contents as a list of lines; each line is a string

with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
    text = file_in.readlines()

print(len(text))
print(type(text))

Inspect parts of the file using list syntax

# View the first 10 lines
text[:10]

Working with unstructured file data

Contents of pettigrew_letters_ORIGINAL.txt

Intro material
Manifest of letters
Individual letters

Query: Are all the letters in the manifest actually there?

check if all the letters reported in the manifest appear in the actual file
check if all the letters in the file are reported in the manifest
Therefore, construct two variables: (1) A list of every location line from the manifest, and (2) a list of every location line within the file proper

Get the manifest by visual inspection

manifest_list = text[14:159]

Use string functions to clean up and inspect text

Demonstrate string tests with manifest_list:

# Raw text
for location in manifest_list[:10]:
    print(location)

# Remove extra whitespace
for location in manifest_list[:10]:
    print(location.strip())

# Test whether the cleaned line starts with 'Box '
for location in manifest_list[:10]:
    stripped_line = location.strip()
    print(stripped_line.startswith('Box '))

# Test whether the cleaned line starts with 'box '
for location in manifest_list[:10]:
    stripped_line = location.strip()
    print(stripped_line.startswith('box '))

Gather all the locations in the full document

letters = text[162:]

for line in letters[:25]:
    # Create a variables to hold current line and truth value of is_box
    stripped_line = line.strip()
    is_box = stripped_line.startswith('Box ')
    if is_box == True:
        print(stripped_line)
    # If the line is empty, don't print anything
    elif stripped_line == '\n':
        continue
    # Indent non-Box lines
    else:
        print('---', stripped_line)

Before automate everything, we run the code with lots of print() statements so that we can see what's happening

Collect the positive results

letter_locations = []

for line in letters:
    stripped_line = line.strip()
    is_box = stripped_line.startswith("Box ")
    if is_box == True:
        letter_locations.append(stripped_line)

Compare the manifest and the letters

print('Items in manifest:', len(manifest_list))
print('Letters:', len(letter_locations))

Follow-up questions

Which items are in one list but not the other?
Are there other structural regularities you could use to parse the data? (Note that in the letters, sometimes there are multiple letters under a single box header)

Exception handling

Explicitly handle common errors, rather than waiting for your code to blow up.

def average(values):
    "Return average of values, or None if no values are supplied."

    if len(values) == 0:
        return None
    return sum(values) / len(values)

print(average([3, 4, 5]))       # Prints expected output
print(average([]))              # Explicitly handles possible divide-by-zero error
print(average(4))               # Unhandled exception

def average(values):
    "Return average of values, or an informative error if bad values are supplied."

    try:
        return sum(values) / len(values)
    except ZeroDivisionError as err:
        return err
    except TypeError as err:
        return err

print(average([3, 4, 5]))
print(average(4))
print(average([]))

Use judiciously, and be as specific as possible. When in doubt, allow your code to blow up rather than silently commit errors.

Performance and profiling

from timeit import time
import cProfile
import pstats

def my_fun(val):
    # Get 1st timestamp
    t1 = time.time()

    # do work

    # Get 2nd timestamp
    t2 = time.time()
    print(round(t2 - t1, 3))

# Run the function with the profiler and collect stats
cProfile.run('my_fun(val)', 'dumpstats')
s = pstats.Stats('dumpstats')

Reducing memory usage

Read a file one line at a time

with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
    for line in file_in:
        # Do stuff to current line
        pass

Use a SQLite database

import sqlite3

conn = sqlite3.connect('my_database_name.db')
with conn:
    c = conn.execute("SELECT column_name FROM table_name WHERE criterion")
    results = c.fetchall()
    c.close

# Do stuff with `results`

Endnotes

Credits

Plotting and Programming in Python (Pandas-oriented): http://swcarpentry.github.io/python-novice-gapminder/
Programming with Python (NumPy-oriented): https://swcarpentry.github.io/python-novice-inflammation/index.html
Python for Ecology: https://datacarpentry.org/python-ecology-lesson/
Humanities Python Tour (file and text processing): https://github.com/elliewix/humanities-python-tour/blob/master/Two-Hour-Beginner-Tour.ipynb
Introduction to Cultural Analytics & Python: https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html
Rhondene Wint: Matplotlib and Seaborn notes
Fruit Alphabet: https://en.wikibooks.org/wiki/Wikijunior:Fruit_Alphabet

References

Data Sources

Gapminder data: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip
Ecology data (field surveys): https://datacarpentry.org/python-ecology-lesson/data/portal-teachingdb-master.zip
Social Science data (SAFI): https://datacarpentry.org/socialsci-workshop/data/
Humanities data (Pettigrew letters): http://dx.doi.org/10.5334/data.1335350291

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
README.org		README.org
_config.yml		_config.yml

ucmerced/python-programming

Folders and files

Latest commit

History

Repository files navigation

Fundamentals (Week 1)

Orientation

What programming language should I use?

Python is pretty good at lots of things

Literate programming and notebooks

Jupyter commands

How to start Jupyter Lab

Navigation

Writing code

Variables and Assignment

Use variables to store values

Rules for naming things

Use print() to display values

Jupyter Lab will always echo the last value in a cell

(Optional) Variables must be created before they are used

Variables can be used in calculations

Challenge: Variables only change value when something is assigned to them

Data Types and Type Conversion

Every value has a type

The type determine what operations you can perform with a given value

Use the built-in function type() to find the type of a value

You can explicitly convert data to a different type

Challenge: Explain what each operator does

Built-in Functions and Help

How do we find out what's possible?

(Optional) Use comments to add documentation to programs

A function may take zero or more arguments

Functions can have optional arguments

Use the built-in function help() to get help for a function

Every function returns something

(Optional) Functions will typically generalize in sensible ways

(Optional) Python produces informative error messages

(Optional) Beginner Challenge: What happens when?

Libraries

Most of the power of a programming language is in its libraries

A program must import a library module before using it

Use help() to learn about the contents of a library module

(Optional) Import shortcuts

Python has opinions about how to write your programs

Lists

A list stores many values in a single structure

Lists are indexed by position, counting from 0

You can get a subset of the list by slicing it

(Optional) Why are lists indexed from 0?

Challenge: Some other properties of indexes

Solution

Lists are mutable

Many functions take collections as arguments

(Optional) Removing items from a list

Lists can contain anything

(Optional) Challenge: Reversing a list

Solution

For Loops

A for loop executes commands once for each value in a collection

The first line of the for loop must end with a colon, and the body must be indented

Loop variables can be called anything

The body of a loop can contain many statements

Create a new collection from an existing collection

Challenge: Accumulation

Solution

(Optional) Use range() to iterate over a sequence of numbers

(Optional) Use enumerate() to iterate over a sequence of items and their positions

(Optional) How do you know if an object is iterable?

Don't use for loops with DataFrames or Numpy matrices

Strings and methods

Strings are (kind of) like lists

But! Strings are immutable

Methods are functions that belong to objects

Use the built-in string methods to clean up data

Building longer strings with .join()

Challenge: Putting it all together

Solution 1 (shuffle)

Solution 2 (sample)

(Optional) Beginner Challenge: From Strings to Lists and Back

(Optional) Dictionaries

Use `print()` to display values

Use the built-in function `type()` to find the type of a value

Use the built-in function `help()` to get help for a function

A program must `import` a library module before using it

Use `help()` to learn about the contents of a library module

A `for` loop executes commands once for each value in a collection

The first line of the `for` loop must end with a colon, and the body must be indented

(Optional) Use `range()` to iterate over a sequence of numbers

(Optional) Use `enumerate()` to iterate over a sequence of items and their positions

Don't use `for` loops with DataFrames or Numpy matrices

Building longer strings with `.join()`

(Optional) Delete an item using `del` or `pop()`

Use `index_col` to use a column’s values as row indices