#install.packages("usethis")
::use_course("https://github.com/r-journalism/nicar-2024-tidycensus/archive/master.zip")
usethis
# Run this in the console of RStudio
file.edit("01_tidycensus.R")
# Or click on the 01_tidycensus.R file in the RStudio file explorer tab
Tidycensus
The repo containing the data and scripts for this section is on Github. To install those files, run the lines of code below.
To follow along with this walkthrough, simply run the lines of code in the gray boxes in the R console. Be sure to run them in order. If you run into an error, it may be because you skipped running some preceding lines of code.
The American Community Survey from the Census
What is the ACS?
Annual survey of 3.5 million US households
Covers more specific topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)
Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
Data delivered as estimates characterized by margins of error
How to get ACS data
tidycensus
Load libraries
library(tidyverse)
library(tidycensus)
Census data issues I
- Groups, sub groups, sub sub groups, etc, are a pain
- Takes forever to tidy up
Census data issues II
- Transposing the data helps a bit but
- Still requires a lot of work to clean up
Tidycensus: Features
Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested)
Automatically downloads and merges Census geometries to data for mapping
Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS
States and counties can be requested by name (no more looking up FIPS codes!)
Script out your process for re usability
R and RStudio
R: programming language and software environment for data analysis (and scraping and visualization and so much more)
RStudio: integrated development environment (IDE) for R developed by Posit
- Built on top of R
- Lets you view your data, write and save R (or Python) scripts or notebooks, and view graphical static and interactive outputs
RStudio tour
Running code in R
<-
assignment saves to the environment/memory#
hashes, commented out code- Copy and paste code into the console to run (without the hash)
- run code in the console at the bottom or
- in a script, highlight the code and click the ‘run’ button at the top right
- or put your cursor in the script on the line of code and hit ctrl+enter (or cmd + enter)
Getting started with tidycensus
To get started, install the packages and files for this class
If you are using an IRE laptop, these packages are already installed for you
install.packages(c("tidycensus", "tidyverse", "mapview", "usethis"))
::use_course("https://github.com/r-journalism/nicar-2024-tidycensus/archive/master.zip") usethis
Optional: your Census API key
tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the
census_api_key()
function to set your key as an environment variable
library(tidycensus)
census_api_key("YOUR KEY GOES HERE", install = TRUE)
Getting started with ACS data in tidycensus
open 01_tidycensus.R in RStudio
Using the get_acs()
function
The
get_acs()
function is your portal to access ACS data using tidycensusThe two required arguments are
geography
andvariables
. The function defaults to the latest 5-year ACS (Currently 2018-2022)
library(tidycensus)
<- get_acs(
median_income geography = "county",
variables = "B25077_001", # median household income
year = 2022
)
Getting data from the 2018-2022 5-year ACS
- ACS data are returned with five columns:
GEOID
,NAME
,variable
,estimate
, andmoe
median_income
# A tibble: 3,222 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B25077_001 191800 7996
2 01003 Baldwin County, Alabama B25077_001 266000 6916
3 01005 Barbour County, Alabama B25077_001 102700 11171
4 01007 Bibb County, Alabama B25077_001 120100 13377
5 01009 Blount County, Alabama B25077_001 159800 6189
6 01011 Bullock County, Alabama B25077_001 87700 20560
7 01013 Butler County, Alabama B25077_001 94800 5984
8 01015 Calhoun County, Alabama B25077_001 140500 5181
9 01017 Chambers County, Alabama B25077_001 116900 9814
10 01019 Cherokee County, Alabama B25077_001 158700 8550
# ℹ 3,212 more rows
Exploring your data with RStudio
View(median_income)
Exporting your data
- You saved the output of the
get_acs()
function to the object median_income - Export that dataframe object to your computer so you can use it wherever you want
library(readr)
write_csv(median_income, "whatever_filename_you_want.csv", na="")
Take your data to Excel if you want
1-year ACS data
1-year ACS data are more current, but are only available for geographies of population 65,000 and greater
Access 1-year ACS data with the argument
survey = "acs1"
; defaults to"acs5"
median_value_1yr <- get_acs(
geography = "place",
variables = "B25077_001", # median value of homes
year = 2022,
survey = "acs1"
)
Getting data from the 2022 1-year ACS
The 1-year ACS provides data for geographies with populations of 65,000 and greater.
median_value_1yr
# A tibble: 646 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 0103076 Auburn city, Alabama B25077_001 335200 22622
2 0107000 Birmingham city, Alabama B25077_001 125500 14964
3 0121184 Dothan city, Alabama B25077_001 190800 8133
4 0135896 Hoover city, Alabama B25077_001 393400 19743
5 0137000 Huntsville city, Alabama B25077_001 294700 16881
6 0150000 Mobile city, Alabama B25077_001 178800 11552
7 0151000 Montgomery city, Alabama B25077_001 155200 10868
8 0177256 Tuscaloosa city, Alabama B25077_001 297600 30475
9 0203000 Anchorage municipality, Alaska B25077_001 367900 10111
10 0404720 Avondale city, Arizona B25077_001 400300 22495
# ℹ 636 more rows
Requesting tables of variables
- The
table
parameter can be used to obtain all related variables in a “table” at once
Getting data from the 2018-2022 5-year ACS
Loading ACS5 variables for 2022 from table B19001. To cache this dataset for faster access to ACS tables in the future, run this function with `cache_table = TRUE`. You only need to do this once per ACS dataset.
income_table
# A tibble: 54,774 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B19001_001 22308 369
2 01001 Autauga County, Alabama B19001_002 990 265
3 01001 Autauga County, Alabama B19001_003 656 187
4 01001 Autauga County, Alabama B19001_004 1026 303
5 01001 Autauga County, Alabama B19001_005 1335 329
6 01001 Autauga County, Alabama B19001_006 741 205
7 01001 Autauga County, Alabama B19001_007 822 218
8 01001 Autauga County, Alabama B19001_008 840 270
9 01001 Autauga County, Alabama B19001_009 921 260
10 01001 Autauga County, Alabama B19001_010 962 279
# ℹ 54,764 more rows
Understanding geography and variables in tidycensus
US Census Geography
Geography in tidycensus
- Information on available geographies, and how to specify them, can be found in the tidycensus documentation
Geography | Definition | Available by | Available in |
---|---|---|---|
"us" |
United States | get_acs() , get_decennial() |
|
"region" |
Census region | get_acs() , get_decennial() |
|
"division" |
Census division | get_acs() , get_decennial() |
|
"state" |
State or equivalent | state | get_acs() , get_decennial() |
"county" |
County or equivalent | state, county | get_acs() , get_decennial() |
"county subdivision" |
County subdivision | state, county | get_acs() , get_decennial() |
"tract" |
Census tract | state, county | get_acs() , get_decennial() |
"block group" OR "cbg" |
Census block group | state, county | get_acs() , get_decennial() |
Querying by state
For geographies available below the state level, the
state
parameter allows you to query data for a specific stateFor smaller geographies (Census tracts, block groups), a
county
argument may also need to be includedtidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
Querying tract data requires county and state
- Example: data on median home value in San Diego County, California by Census tract
sd_value <- get_acs(
geography = "tract",
variables = "B25077_001",
state = "CA",
county = "San Diego",
year = 2022
)
Getting data from the 2018-2022 5-year ACS
sd_value
# A tibble: 737 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06073000100 Census Tract 1; San Diego County; Calif… B25077_… 1633800 71171
2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_… 1331000 147432
3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_… 891100 97240
4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_… 957500 232555
5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_… 761700 108681
6 06073000400 Census Tract 4; San Diego County; Calif… B25077_… 799100 94490
7 06073000500 Census Tract 5; San Diego County; Calif… B25077_… 1025000 81768
8 06073000600 Census Tract 6; San Diego County; Calif… B25077_… 727700 92078
9 06073000700 Census Tract 7; San Diego County; Calif… B25077_… 736400 102788
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_… 678400 119751
# ℹ 727 more rows
Searching for variables
To search for variables, use the
load_variables()
function along with a year and datasetThe
View()
function in RStudio allows for interactive browsing and filtering
<- load_variables(2022, "acs5") vars
View(vars)
Available ACS datasets in tidycensus
Detailed Tables
Data Profile (add
"/profile"
for variable lookup)Subject Tables (add
"/subject"
)Comparison Profile (add
"/cprofile"
)Supplemental Estimates (use
"acsse"
)Migration Flows (access with
get_flows()
)
Data structure in tidycensus
“Tidy” or long-form data
- The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
<- get_acs(
age_sex_table geography = "state",
table = "B01001",
year = 2022,
survey = "acs1",
)
Getting data from the 2022 1-year ACS
The 1-year ACS provides data for geographies with populations of 65,000 and greater.
Loading ACS1 variables for 2022 from table B01001. To cache this dataset for faster access to ACS tables in the future, run this function with `cache_table = TRUE`. You only need to do this once per ACS dataset.
age_sex_table
# A tibble: 2,548 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B01001_001 5074296 NA
2 01 Alabama B01001_002 2461248 6178
3 01 Alabama B01001_003 146169 3134
4 01 Alabama B01001_004 158767 6029
5 01 Alabama B01001_005 164578 5689
6 01 Alabama B01001_006 97834 3029
7 01 Alabama B01001_007 70450 2897
8 01 Alabama B01001_008 42597 4156
9 01 Alabama B01001_009 34623 3440
10 01 Alabama B01001_010 97373 4627
# ℹ 2,538 more rows
“Wide” data
- The argument
output = "wide"
spreads Census variables across the columns, returning one row per geographic unit and one column per variable
age_sex_table_wide <- get_acs(
geography = "state",
table = "B01001",
year = 2022,
survey = "acs1",
output = "wide"
)
Getting data from the 2022 1-year ACS
The 1-year ACS provides data for geographies with populations of 65,000 and greater.
Loading ACS1 variables for 2022 from table B01001. To cache this dataset for faster access to ACS tables in the future, run this function with `cache_table = TRUE`. You only need to do this once per ACS dataset.
age_sex_table_wide
# A tibble: 52 × 100
GEOID NAME B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 5074296 NA 2461248 6178 146169
2 02 Alaska 733583 NA 385667 2351 23043
3 04 Arizona 7359197 NA 3678381 2695 201423
4 05 Arkansas 3045637 NA 1504488 4216 90239
5 06 California 39029342 NA 19536425 6410 1081904
6 08 Colorado 5839926 NA 2960896 4278 154565
7 09 Connecticut 3626205 NA 1776689 2237 91513
8 10 Delaware 1018396 NA 494657 1092 27456
9 11 District o… 671803 NA 319763 733 20038
10 12 Florida 22244823 NA 10953468 6169 563703
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
# B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
# B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
# B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
# B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
# B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …
Using named vectors of variables
Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the
variable
column; in wide form, they will replace the column names
Renaming variables easily
ca_education <- get_acs(
geography = "county",
state = "CA",
variables = c(percent_high_school = "DP02_0062P",
percent_bachelors = "DP02_0065P",
percent_graduate = "DP02_0066P"),
year = 2021
)
Getting data from the 2017-2021 5-year ACS
Using the ACS Data Profile
ca_education
# A tibble: 174 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06001 Alameda County, California percent_high_school 16.7 0.4
2 06001 Alameda County, California percent_bachelors 28.3 0.3
3 06001 Alameda County, California percent_graduate 21.3 0.3
4 06003 Alpine County, California percent_high_school 25.7 7.5
5 06003 Alpine County, California percent_bachelors 20.6 7.5
6 06003 Alpine County, California percent_graduate 18.7 8.5
7 06005 Amador County, California percent_high_school 30.7 2.2
8 06005 Amador County, California percent_bachelors 13.6 1.8
9 06005 Amador County, California percent_graduate 5.9 1.1
10 06007 Butte County, California percent_high_school 22.3 0.9
# ℹ 164 more rows
ACS data warnings
Understanding limitations of the 1-year ACS
The 1-year American Community Survey is only available for geographies with population 65,000 and greater. This means:
Only 848 of 3,221 counties are available
Only 646 of 31,908 cities / Census-designated places are available
No data for Census tracts, block groups, ZCTAs, or any other geographies that typically have populations below 65,000
Data sparsity and margins of error
You may encounter data issues in the 1-year ACS data that are less pronounced in the 5-year ACS. For example:
Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables
If available, they will likely have larger margins of error
Your job as an data journalist: balance need for certainty vs. need for recency in estimates
Tagalog speakers by state (1-year ACS)
get_acs(
geography = "state",
variables = "B16001_099",
year = 2022,
survey = "acs1"
)
Getting data from the 2022 1-year ACS
The 1-year ACS provides data for geographies with populations of 65,000 and greater.
# A tibble: 52 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B16001_099 5222 1487
2 02 Alaska B16001_099 NA NA
3 04 Arizona B16001_099 28522 3746
4 05 Arkansas B16001_099 NA NA
5 06 California B16001_099 760215 22953
6 08 Colorado B16001_099 9417 2178
7 09 Connecticut B16001_099 8568 2657
8 10 Delaware B16001_099 1112 601
9 11 District of Columbia B16001_099 NA NA
10 12 Florida B16001_099 80209 7025
# ℹ 42 more rows
Tagalog speakers by state (5-year ACS)
get_acs(
geography = "state",
variables = "B16001_099",
year = 2022,
survey = "acs5"
)
Getting data from the 2018-2022 5-year ACS
# A tibble: 52 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B16001_099 3854 553
2 02 Alaska B16001_099 18520 1414
3 04 Arizona B16001_099 25913 1774
4 05 Arkansas B16001_099 3154 515
5 06 California B16001_099 772833 10558
6 08 Colorado B16001_099 8724 834
7 09 Connecticut B16001_099 8353 969
8 10 Delaware B16001_099 2837 668
9 11 District of Columbia B16001_099 1250 306
10 12 Florida B16001_099 70430 2877
# ℹ 42 more rows
Other warnings
Variables in the Data Profile and Subject Tables can change names over time
The 2022 ACS is the first to include the new Connecticut Planning Regions in the “county” geography
The 2020 Decennial US Census data and R
What is the decennial US Census?
Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution
Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790
Used for proportional representation / congressional redistricting
Limited set of questions asked about race, ethnicity, age, sex, and housing tenure
2020 US Census datasets
- The PL 94-171 Redistricting Data
- The Demographic and Housing Characteristics (DHC) file
- The Demographic Profile (for pre-tabulated variables)
- Tabulations for the 118th Congress & for Island Areas
- The Detailed DHC-A file (with very detailed racial & ethnic categories)
2020 US Census in Tidycensus
The
get_decennial()
function is used to acquire data from the decennial US CensusThe two required arguments are
geography
andvariables
for the functions to work; for 2020 Census data, useyear = 2020
.
<- get_decennial(
pop20 geography = "state",
variables = "P1_001N",
year = 2020
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data Summary File
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
- Decennial Census data are returned with four columns: GEOID, NAME, variable, and value
pop20
# A tibble: 52 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 42 Pennsylvania P1_001N 13002700
2 06 California P1_001N 39538223
3 54 West Virginia P1_001N 1793716
4 49 Utah P1_001N 3271616
5 36 New York P1_001N 20201249
6 11 District of Columbia P1_001N 689545
7 02 Alaska P1_001N 733391
8 12 Florida P1_001N 21538187
9 45 South Carolina P1_001N 5118425
10 38 North Dakota P1_001N 779094
# ℹ 42 more rows
Scavenger hunt
Can you look through the vars
table you loaded earlier and import the table that can answer this?
- How many 18 to 24 year old Korean people are there in the US (2021)?
- What percent of females in 2017 were below poverty level in the US (5 year)?
<- load_variables(2022, "acs5")
vars
get_acs(replace_this_with_the_right_arguments)
_How do you find the “right” variables or Census table ID? I do a couple things: Use CensusReporter.org or I ask the oldest data reporter in the newsroom.