IS310 - Culture As Data Spring 2026 – (Re)Introduction to Notebooks

So far in the course, when we’ve been writing Python code, we’ve either been using the Python interpreter in the terminal or saving a Python script. However, there’s a third way to write Python code that’s very popular - that is with Jupyter notebooks https://jupyter.org/. While many of you are familiar with Jupyter notebooks, this lesson will serve as a refresher and introduction to some of the features of Jupyter notebooks, as well as how to use them with the Pandas library.

What Are Jupyter Notebooks?

Brief History of Jupyter Notebooks

Connection To CU

Rise of IPython

The Notebook Wars

Jupyter notebooks since their release have become enormously popular and a key infrastructure for the growth of data science since the mid-2010s. One of the things that made them so useful is that you could use them not just for Python, but also for R and Julia. For those unfamiliar, R is a programming language that is particularly popular in statistics and data science, while Julia is a newer programming language that is designed for high-performance numerical and scientific computing. R developed largely at the same time as Python, and Julia was created in 2012.

One of the big things that has happened in the last few years is the convergence of Project Jupyter with R and RStudio, through the creation of Posit, which was founded in 2022. This might be surprising to learn if you’ve heard people debate Python vs R, or if you’ve heard of the “notebook wars” of the late 2010s (you can read more about that here https://yihui.org/en/2018/09/notebook-war/).

The Rise of Quarto

Alternatives to Jupyter

Why Use Jupyter Notebooks?

While this history might make you hesitant to use or learn Jupyter notebooks, they are incredibly useful for a number of reasons.For example, we already mentioned that rather than writing a script and running it, you can write code in a Jupyter notebook and then run the cell to see the results. This is particularly useful for data analysis and debugging, as you can see the results of your code immediately.

Some people even use Jupyter notebooks to publish books and articles. For example, Melanie Walsh’s textbook Introduction to Cultural Analytics & Python is a collection of Jupyter notebooks, combined into a Jupyter book (you can explore the code in this GitHub repository: https://github.com/melaniewalsh/Intro-Cultural-Analytics). Another great example is the Journal of Digital History, which publishes Jupyter notebooks as articles (you can explore their articles here: https://journalofdigitalhistory.org/).

Getting Started: Installation

Before we install Jupyter, what should we do?

pip install jupyter

What to follow along interactively?

Download the Jupyter notebook version of this lesson: 01-intro-notebooks.ipynb

Open it in Jupyter and run the code examples as you read through the lesson. You can experiment, modify examples, and practice everything in real time!

Getting Started: Virtual Environment Setup

Next, we need to tell Jupyter about our virtual environment

python -m ipykernel install --user --name=is310-class-env #Or whatever you named your virtual environment

Running Jupyter Notebooks & Localhost

jupyter notebook

You should see the Jupyter interface open in your browser. What is the domain name?

New Notebook

Renaming & Saving Notebooks

Key Shortcuts

Mac	Jupyter Function	Windows
`Shift` + `Return`	Run cell (Both modes)	`Shift` + `Enter`
`Option` + `Return`	Create new cell below (Both modes)	`Alt` + `Enter`
`B`	Create new cell below (Command mode)	`B`
`A`	Create new cell above (Command mode)	`A`
`D` + `D`	Delete cell (Command mode)	`D` + `D`
`Z`	Undo cell action (Command mode)	`Z`
`Shift` + `M`	Merge cells (Command mode)	`Shift` + `m`
`Control` + `Shift` + `-`	Split cell into two cells (Edit mode)	`Control` + `Shift` + `-`
`Tab`	Autocomplete file/variable/function name (Edit mode)	`Tab`

Jupyter Notebooks in VS Code

How can we quit the Jupyter notebook server and work in VS Code instead?

While we can run our notebooks in the browser, we can also run them in VS Code. The benefit to running them in VS Code is that we can use the same interface for writing code and running notebooks, and we can take advantage of the other features of VS Code, such as the debugger, autocomplete, and GitHub Co-Pilot.

First, we should stop the Jupyter notebook server, which you can do by either pressing the Quit button on the main page or type Ctrl+c in the terminal and confirming by typing y and pressing enter. Then we can open VS Code and navigate to the directory where our notebook is located. You can either open the file only or open the entire directory.

Once you have your notebook open, you’ll again see a Select Kernel option in the top right corner. You can select your virtual environment from this dropdown menu. And if you want, you can create a new notebook entirely in VS Code by pressing ctrl + shift + p and then typing Jupyter: Create New Blank Notebook.

And finally to save your notebook, you can press ctrl/cmd + s or go to File and then Save.

Writing Code & Markdown in Jupyter Notebooks

Try pasting the following within the first cell of the Jupyter notebook. What should we see in our Notebook?

def check_movie_release(movie):
    if movie['release_year'] < 2000:
        print(f"{movie['name']} was released before 2000")
    else:
        print(f"{movie['name']} was released after 2000")
        return movie['name']

recent_movies = []

favorite_movies =[
    {
        "name": "The Matrix IV",
        "release_year": 2022,
        "sequels": ["The Matrix I", "The Matrix II", "The Matrix III"]
    },
    {
        "name": "Star Wars IV",
        "release_year": 1977,
        "sequels": ["Star Wars V", "Star Wars VI", "Star Wars VII", "Star Wars VIII", "Star Wars IX"],
        "prequels": ["Star Wars I", "Star Wars II", "Star Wars III"]
    },
    {
        "name": "The Lord of the Rings: The Fellowship of the Ring",
        "release_year": 2001,
        "sequels": ["The Two Towers", "The Return of the King"]
    }
]

for movie in favorite_movies:
    result = check_movie_release(movie)
    if result is not None:
        recent_movies.append(result)

print(recent_movies)

Important: Notebooks Hold State

Unlike with scripts, notebooks hold variables in memory, which means that our function now exists in the notebook and we can call in a new cell.

To test this out, add a new cell below our function one by pressing the + Code symbol when you hover. Then paste the following code and run it.

def updated_check_movie_release(movie, released_after_year, released_before_year=2024):
    if released_after_year < movie['release_year'] and movie['release_year'] < released_before_year:
        movie['recent'] = True
    else:
        movie['recent'] = False
    return movie

Now we can call this function in our for-loop by updating the original code:

for movie in favorite_movies:
    updated_movie = updated_check_movie_release(movie, 2020)
    if updated_movie['recent']:
        recent_movies.append(updated_movie['name'])

Important: Notebooks Hold State

Introduction to Pandas

The Pandas Library https://pandas.pydata.org/docs/index.html

History of Pandas

BDFL of Pandas

Who is Wes McKinney?

Highly recommend checking out his recent blog post https://wesmckinney.com/transcripts/2026-02-10-rill-data-podcast where he talks about his vision for the future of data science.

Pandas Basics

First we need to install pandas, which we can do by visiting the documentation and following the steps under the Getting Started and Installation https://pandas.pydata.org/docs/getting_started/install.html.

pip install pandas

You’re welcome to use any installation method, but I highly recommend installing into a virtual environment, and using pip to install https://pandas.pydata.org/docs/getting_started/install.html#installing-from-pypi.

Using Pandas in Jupyter Notebooks

Now if we start to go through the tutorials in the Getting Started section, we can start to learn how to work with pandas. Specifically, this tutorial on reading and writing data with pandas: https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html.

You’ll notice the first line of the tutorial is the following syntax:

import pandas as pd

How can we check the version of Pandas?

Remember the as keyword is used to give a library an alias, so that we don’t have to type pandas each time we want to use a feature of the library. We can check that this worked, by outputting the version of Pandas, with the following:

pd.__version__

Reading Data with Pandas

How could we use a remote URL to read data?

Read CSV with Pandas

import pandas as pd

parks_data_df = pd.read_csv('https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/national-parks/US-National-Parks_RecreationVisits_1979-2023.csv')

How can we see the type of data we have in our parks_data_df variable?

I named the variable holding the data parks_data_df, which is a common convention in data science to use the _df suffix to indicate that the variable is a DataFrame. I could’ve been a bit more expressive and named it national_parks_visits_df, but I wanted to keep it short for typing. You’re welcome to use any naming convention you prefer, but I would recommend being consistent and using the _df suffix for DataFrames.

So now let’s explore parks_data_df.

First, we can test what the variable contains by using type(parks_data_df), which tells us that it is a pandas.core.frame.DataFrame. Notice how we aren’t using the print() method here, but rather just typing the variable name and running the cell. That’s because Jupyter notebooks will automatically print the output of the last line of a cell.

type(parks_data_df)

What is a DataFrame?

DataFrames are the primary data structures or Classes in pandas and are defined as:

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

Is there a built-in Python method we could use to learn more?

DataFrames are the primary data structures in pandas. They are two-dimensional, meaning they have both rows and columns, and can store different types of data. Each column is essentially a Series, which is a one-dimensional data structure.

We can either visit the documentation for DataFrame directly here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame or we can also use the Python built-in help() function to learn more about the DataFrame class.

help(pd.DataFrame)

These details are very technical but we can start to get a sense that DataFrames are powerful data structures, containing both rows and columns, and that they can hold different types of data.

Exploring DataFrames

How could we see what is in our DataFrame?
What are some of the built-in methods we can use to explore our DataFrame?
How might we see the first few rows of our DataFrame?
How might we see the number of rows and columns in our DataFrame?

Exploring DataFrames

parks_data_df.head()        # First few rows
parks_data_df.shape         # Number of rows and columns
parks_data_df.dtypes        # Data types of each column

Pandas Data Types

Pandas dtype	Python type	NumPy type	Usage
object	str or mixed	string_, unicode_, mixed types	Text or mixed numeric and non-numeric values
int64	int	int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64	Integer numbers
float64	float	float_, float16, float32, float64	Floating point numbers
bool	bool	bool_	True/False values
datetime64	NA	datetime64[ns]	Date and time values
timedelta[ns]	NA	NA	Differences between two datetimes
category	NA	NA	Finite list of text values

Pandas data types build from ones available in Python. This table compares Pandas to Python and another library called numpy (you can read more about Pandas data types here https://pandas.pydata.org/docs/reference/arrays.html#pandas-arrays-scalars-and-data-types.

Exploring DataFrames

We can get an overview of our data using two additional methods: info() and describe(). The info() method gives us a summary of the DataFrame, including the number of non-null values in each column, while the describe() method gives us summary statistics for the numeric columns.

parks_data_df.info()

parks_data_df.describe()

Indexing & Selecting Data with Pandas

At a high level, we can select data using the following methods:

Selecting columns: We can select a single column using a single bracket, or multiple columns using double brackets.
Selecting rows: We can select rows using the loc[] and iloc[] methods.
Boolean indexing: We can filter data based on conditions.
Setting values: We can set values in the DataFrame using the loc[] method.
Using the query() method: We can filter data using a query string.
Using the filter() method: We can filter data based on labels.
Using the at[] and iat[] methods: We can access a single value for a row/column label pair or by integer position.

Selecting Columns

Let’s start by trying to select columns and explore the Year column. Try typing in one cell parks_data_df['Year'] and then in the following cell parks_data_df[['Year']].

What differences do you notice?

Let’s try out some examples:

Type parks_data_df[0:5] in a cell and run it. What results do you get?
Type parks_data_df[['Year', 'Region']] in a cell and run it. What results do you get?

Selecting Columns

Series vs DataFrame

Exploring Series

How could we get a list of all unique values in a Year column?

Let’s try the to_list() method, which we can read about here https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html.

You should get a long list of all the values in the column as your output. However, it might be hard to tell if we have unique values. We can use the unique() method to see just the unique values in the column, https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html.

Renaming Columns

While the columns in the parks_data_df are capitalized, usually we try to keep column names lowercase and use underscores instead of spaces for ease of typing. How could we rename them to be lowercased and have underscores instead of spaces?

While the columns in the parks_data_df are capitalized, usually we try to keep column names lowercase and use underscores instead of spaces for ease of typing. So let’s try renaming all of the columns programmatically. First, we need to get all the column names using the columns attribute of the DataFrame, which can see is an attribute in the documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html.

parks_data_df.columns

Then we can use a list comprehension to create a new list of column names that are all lowercase and have underscores instead of spaces.

new_column_names = [col.lower().replace(' ', '_') for col in parks_data_df.columns]

Finally, we can assign this new list to the columns attribute of the DataFrame.

parks_data_df.columns = new_column_names
parks_data_df.columns

You’ll notice that column names like RecreationVisits are now recreationvisits, which is much easier to work with but still a bit hard to read. We could use the following code

import re

# We need to reload the data to reset the column names back to their original state
parks_data_df = pd.read_csv('https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/national-parks/US-National-Parks_RecreationVisits_1979-2023.csv') # or use the local path pd.read_csv('US-National-Parks_RecreationVisits_1979-2023.csv')

# Function to split on uppercase letters and insert underscores
def split_on_uppercase(name):
    return re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower()

# Apply the function to each column name
new_column_names = [split_on_uppercase(col) for col in parks_data_df.columns]
parks_data_df.columns = new_column_names

Now if we check the column names again, we should see that they are all lowercase and have underscores instead of spaces. However, we will only see that if we reload the original dataset, otherwise it will not work. Why is that? Hint: Remember Jupyter Notebooks are not Python scripts!

Filtering Data

One of the things that makes Pandas so powerful is that it allows us to filter data easily. We can filter data using boolean indexing, which allows us to select rows based on conditions, just like if statements in Python.

For example, if we want to filter the DataFrame to only include rows for visits to Illinois, we could select the state column and see how many rows have the value IL, we can do the following:

parks_data_df[parks_data_df['state'] == 'IL']

Filtering Data

Now we’ll see that this gives us zero values because this dataset is only for “the current 63 National Parks administered by the United States National Park Service (NPS), from 1979 to 2023.”

How else could we have discovered this using Pandas?

While we could’ve checked this map prior to filtering, we can also check the unique values in the state column to see what values exist.

parks_data_df['state'].unique()

However, unique() only gives us the unique values in the column, but it doesn’t tell us how many times each value appears.

Value Counts Method

We can use the value_counts() method to see how many times each value appears in the state column! We can read more about the method here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html#pandas-dataframe-value-counts.

parks_data_df['state'].value_counts()

Now what if we only wanted to include rows for a specific state?

Filtering Data with Boolean Indexing

# Boolean indexing
parks_data_df[parks_data_df['state'] == 'CA']

# Multiple values
parks_data_df[parks_data_df['state'].isin(['CA', 'AK'])]

What if we wanted to see how many rows there are for parks are in each state?

Grouping Data

Grouping Data

First, we select a or multiple columns we want to group our data by. For example, in our case we could group by state so that we could see how many parks exist in each state.

parks_data_df.groupby('state')

While this code works, we’ll see that it only returns a DataFrameGroupBy object, which is not very useful on its own. To see the actual data, we could use the get_group() method to see the data for a specific state.

parks_data_df.groupby('state').get_group('CA')

How could we use groupby to count the number of parks in each state?

Grouping Data

parks_data_df.groupby('state').size()

parks_data_df.groupby('state')['park_name'].nunique()

Now we see that we are getting a subset of our data! However, where groupby() really shines is when we want to perform a transformation on the grouped data. For example, if we want to count the number of parks in each state, we could do the following:

This code uses the size() method to count the number of rows in each group (you can read more about it here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html#pandas.core.groupby.DataFrameGroupBy.size). In this code, we’re getting all the rows in the DataFrame, grouping them by the state column, and then counting the number of rows in each group. However, if we wanted the number of unique parks, we could use the nunique() method instead (you can read more about it here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.nunique.html#pandas.core.groupby.DataFrameGroupBy.nunique).

While we are getting results, notice how they look slightly different than what we got when we used size(). This is because nunique() is returning a Series, while size() is returning a DataFrame. When you use groupby(), you can not only specify what columns you want to group by, but also what columns you want to aggregate, which is what we are doing with the park_name column.

Reseting Index

Because we are getting a Series, we can also convert it back to a DataFrame using the reset_index() method, which will turn the Series back into a DataFrame.

parks_data_df.groupby('state')['park_name'].nunique().reset_index()

Common Aggregation Methods

How could we create a new variable parks_per_state_df and what aggregation should we use to count the number of unique parks in each state?

Pandas method	Explanation
`.count()`	Number of non-null observations
`.sum()`	Sum of values
`.mean()`	Mean of values
`.median()`	Median of values
`.min()`	Minimum value
`.max()`	Maximum value
`.mode()`	Mode
`.std()`	Sample standard deviation of values
`.describe()`	Compute set of summary statistics for Series or each DataFrame column
`.value_counts()`	Count unique values in a Series
`.nunique()`	Count unique values in a Series
`.size()`	Count number of rows in each group

parks_per_state_df = parks_data_df.groupby('state')['park_name'].nunique().reset_index()

nunique() is just one of many methods we could use with groupby(). We could also use sum(), mean(), min(), max(), and more. Here’s a short summary table of some of the built-in methods we can use with both Series, DataFrames, and GroupBy objects, and you can see more in the user guide here: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics.

Renaming Columns

How could we rename the park_name column to num_parks to more accurately reflect its content?

More info available here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html#pandas-dataframe-rename.

parks_per_state_df.rename(columns={'park_name': 'num_parks'})

Now we have a new DataFrame that contains the number of parks in each state, but you’ll notice that the column names are identical to the original parks_data_df. That’s because groupby() does not automatically rename the columns. So we can rename the columns to something more descriptive using the rename() method (more info available here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html#pandas-dataframe-rename). We can rename columns to something more descriptive. The rename() method takes a dictionary mapping old column names to new ones. The inplace=True argument modifies the DataFrame directly rather than returning a new one.

Inplace vs Not In Place

As you can see in this figure, we can use the rename() method, which requires passing a dictionary to the columns parameter, where the keys are the old column names and the values are the new column names.

parks_per_state_df.rename(columns={'park_name': 'num_parks'})

However one issue with this code is that if we use parks_per_state_df in a new cell, it won’t show this new column name. To save our result we need to use the inplace argument in rename.

parks_per_state_df.rename(columns={'park_name': 'num_parks'}, inplace=True)

Inplace vs Not In Place

Many methods in Pandas have an inplace argument, which allows us to modify the DataFrame directly rather than returning a new DataFrame. For example, the sort_values() method also has an inplace argument.

parks_per_state_df.sort_values(by='num_parks', ascending=False, inplace=True)

Now when we run this cell, parks_per_state_df will have the new column, and be sorted by the number of parks in descending order.

Adding New Data

How could we add Illinois to our DataFrame?

Adding New Data

new_state = pd.DataFrame({'state': ['IL'], 'num_parks': [0]})
parks_per_state_df = pd.concat([parks_per_state_df, new_state], ignore_index=True)

Any alternatives?

Let’s break this down. First, we create a new DataFrame called new_state that contains the new data we want to add. There are many ways we could create this DataFrame, but we used a dictionary where the keys are the column names and the values are lists of the data we want to add. Alternatively, we could also use a list of dictionaries, like the following:

new_state = pd.DataFrame([{'state': 'IL', 'num_parks': 0}])

The core thing is that we are using the DataFrame class to create a new DataFrame, and passing in data that has the same column names as the original DataFrame. You’ll notice I’m assigning num_parks as 0, since we know that there are no parks in Illinois. However, we could also have used NaN to indicate that we don’t have data for that state.

Adding New Data

import numpy as np

new_state = pd.DataFrame({'state': ['IL'], 'num_parks': [np.nan]})
new_state = pd.DataFrame([{'state': 'IL', 'num_parks': None}])

Combining DataFrames

How can we combine DataFrames?

combined_parks_per_state_df = pd.concat([parks_per_state_df, new_state], ignore_index=True)

Quick Exercise: Answering Questions with Pandas

In the documentation for the National Parks Dataset, Walsh and Keyes have an Exercises page that you can find here: https://www.responsible-datasets-in-context.com/posts/np-data/exercises-python/NP-Data-Groupby-Pandas.html.

In that exercise, they have three questions:

What is the average number of visits for each state?
What is the average number of visits for each National Park?
How many National Parks are there in each state?

Quick Exercise: Answering Questions with Pandas

Let’s try and answer them together with what we have learned so far! In your is310-coding-assignments repository, create a new folder called pandas-eda and in that folder create a new Jupyter Notebook called NationalParksEDA.ipynb. Feel free to also move your IntroNotebooks.ipynb into that folder as well.

In the Notebook, you should start by adding a title in Markdown and a brief description of what you are doing. Then, you can import the necessary libraries and read in the dataset.

Then you should create a section for each question, where you can write the code to answer the question. You can use the methods we learned in this lesson, such as groupby(), mean(), and size(), to help you answer the questions.

To create a section, you can use the # symbol in Markdown to create a header. For example, you could use ## Average Number of Visits for Each State for the first question, which would create a second-level header.