Build | DOI | Docs |
---|---|---|
This R package provides a function to easily build panel data from PSID raw data.
Warning: the wealth-supplement setup has changed on the PSID system. wealth variables are now part of the family files for waves 1999 onwards. The
wealth=TRUE
option has therefore been removed from the package. See this issue for more details.
The package is on CRAN, so just type
install.packages('psidR')
Alternatively to get the up-to-date version from this repository,
install.packages('devtools')
install_github("psidR",username="floswald")
The Panel Study of Income Dynamics is a publicly available dataset.
- you can use the data center to build simple datasets
- not workable for larger datasets
- some variables don't show up (although you know they exist)
- the ftp interface gets slower the more periods you are looking at
- the click and scroll exercise of selecting the right variables in each period is extremely error prone.
- merging the data manually is non-trivial.
This package attempts to help the task of building a panel dataset. The user directly downloads ASCII data from the PSID server into R
, without the need for any other software like stata or sas. To build the panel, the user must then specify the variable names in each wave of the questionnaire in a data.frame fam.vars
, as well as the variables from the individual index in ind.vars
. The helper function getNamesPSID
is helpful in finding different variable names across waves - see examples below.
- You must supply at least one data.frame with variables to read from the family file. Most of the time you will also supply a data.frame with variables from the individual files to read.
- Those dataframes must be in the following format. I.e. column
year
is an integer and indicates calendar year, the other columns are the variable names which will appear in your panel.
> head(i) # individiual file example
year age educ empstat weight
1: 1968 ER30004 ER30010 <NA> ER30019
2: 1969 ER30023 <NA> <NA> ER30042 # NOTICE THE NA for educ HERE!!
3: 1970 ER30046 ER30052 <NA> ER30066
4: 1971 ER30070 ER30076 <NA> ER30090
5: 1972 ER30094 ER30100 <NA> ER30116
6: 1973 ER30120 ER30126 <NA> ER30137
> head(f)) # family file example
year age_youngest_child debt empstat_ faminc hours hvalue ...
1: 1968 V120 <NA> V196 V81 V47 V5 ...
2: 1969 V1013 <NA> V639 V529 V465 V449 ...
3: 1970 V1243 <NA> V1278 V1514 V1138 V1122 ...
4: 1971 V1946 <NA> V1983 V2226 V1839 V1823 ...
5: 1972 V2546 <NA> V2581 V2852 V2439 V2423 ...
6: 1973 V3099 <NA> V3114 V3256 V3027 V3021 ...
Example usage:
> library(psidR)
> build.psid(datadir = "~/data/PSID", small = TRUE) # directory `datadir` must exist!
INFO [2021-07-13 10:34:26] Will download missing datasets now
INFO [2021-07-13 10:34:26] will download family files: 2013, 2015
INFO [2021-07-13 10:34:26] will download latest individual index: IND2019ER
This can take several hours/days to download.
want to go ahead? give me 'yes' or 'no'.yes
please enter your PSID username: *****
please enter your PSID password: *****
INFO [2021-07-13 10:34:41] downloading file ~/data/PSID/FAM2013ER
INFO [2021-07-13 10:34:56] now reading and processing SAS file ~/data/PSID/FAM2013ER into R
INFO [2021-07-13 10:40:06] downloading file ~/data/PSID/FAM2015ER
INFO [2021-07-13 10:40:22] now reading and processing SAS file ~/data/PSID/FAM2015ER into R
INFO [2021-07-13 10:45:34] downloading file ~/data/PSID/IND2019ER
INFO [2021-07-13 10:46:39] now reading and processing SAS file ~/data/PSID/IND2019ER into R
INFO [2021-07-13 11:15:04] finished downloading files to ~/data/PSID/
INFO [2021-07-13 11:15:04] continuing now to build the dataset
INFO [2021-07-13 11:15:04] psidR: Loading Family data from .rda files
INFO [2021-07-13 11:15:12] psidR: loaded individual file: ~/data/PSID/IND2019ER.rda
INFO [2021-07-13 11:15:12] psidR: total memory load in MB: 1538
INFO [2021-07-13 11:15:12] psidR: currently working on data for year 2013
INFO [2021-07-13 11:15:12] full 2013 sample has 82573 obs
INFO [2021-07-13 11:15:12] you selected 34856 obs belonging to SRC
INFO [2021-07-13 11:15:12] dropping non-heads leaves 5450 obs
INFO [2021-07-13 11:15:14] psidR: currently working on data for year 2015
INFO [2021-07-13 11:15:14] full 2015 sample has 82573 obs
INFO [2021-07-13 11:15:14] you selected 34856 obs belonging to SRC
INFO [2021-07-13 11:15:14] dropping non-heads leaves 5318 obs
INFO [2021-07-13 11:15:16] End of build.panel
First present a real world example building a full 1968-2017 panel. Then we show some tests.
- You want a
data.table
with the following columns:PID,year,income,wage,age,educ
and some more variables. - You went to the PSID variable search to look up the relevant variable names in each year in either the
individual-level
orfamily-level
datasets. - You created a list of those variables as I did in
inst/psid-lists
of this package - You noted that there is NO EDUCATION variable in the individual index file in 1968 and 1969
- Instead of the variable name for
EDUC
in 1968 and 1969 you want to putNA
- Instead of the variable name for
- You noted that there is NO HOURLY WAGE variable in the family index file in 1993
- Instead of the variable name for
HOURLY WAGE
in 1993 you want to putNA
- Instead of the variable name for
# Build panel with income, wage, age, education and several other variables
# [this is the body of the function build.psid()]
library(psidR)
library(data.table)
r = system.file(package="psidR")
f = fread(file.path(r,"psid-lists","famvars.txt"))
i = fread(file.path(r,"psid-lists","indvars.txt"))
> i
dataset year variable label name
1: PSID Individual Data by Years 1968 ER30019 INDIVIDUAL WEIGHT 68 weight
2: PSID Individual Data by Years 1969 ER30042 INDIVIDUAL WEIGHT 69 weight
3: PSID Individual Data by Years 1970 ER30066 INDIVIDUAL WEIGHT 70 weight
4: PSID Individual Data by Years 1971 ER30090 INDIVIDUAL WEIGHT 71 weight
5: PSID Individual Data by Years 1972 ER30116 INDIVIDUAL WEIGHT 72 weight
---
143: PSID Individual Data Index 2009 ER34020 HIGHEST GRADE FINISHED educ
144: PSID Individual Data Index 2011 ER34119 HIGHEST GRADE FINISHED educ
145: PSID Individual Data Index 2013 ER34230 HIGHEST GRADE FINISHED educ
146: PSID Individual Data Index 2015 ER34349 HIGHEST GRADE FINISHED educ
147: PSID Individual Data Index 2017 ER34548 HIGHEST GRADE FINISHED educ
> f
dataset year variable label name
1: PSID Main Family Data 1968 V47 HD ANN HRS WORKED LAST YR hours
2: PSID Main Family Data 1969 V465 HD ANN HRS WORKED LAST YR hours
3: PSID Main Family Data 1970 V1138 HD ANN HRS WORKED LAST YR hours
4: PSID Main Family Data 1971 V1839 HD ANN HRS WORKED LAST YR hours
5: PSID Main Family Data 1972 V2439 HD ANN HRS WORKED LAST YR hours
---
609: PSID Family-level 2009 ER42139 A52 LIKELIHOOD OF MOVING likelihood_move
610: PSID Family-level 2011 ER47447 A52 LIKELIHOOD OF MOVING likelihood_move
611: PSID Family-level 2013 ER53147 A52 LIKELIHOOD OF MOVING likelihood_move
612: PSID Family-level 2015 ER60162 A52 LIKELIHOOD OF MOVING likelihood_move
613: PSID Family-level 2017 ER66163 A52 LIKELIHOOD OF MOVING likelihood_move
# alternatively, use `getNamesPSID`:
# cwf <- openxlsx::read.xlsx("http://psidonline.isr.umich.edu/help/xyr/psid.xlsx")
# Suppose you know the name of the variable in a certain year, and it is
# "ER17013". then get the correpsonding name in another year with
# getNamesPSID("ER17013", cwf, years = 2001) # 2001 only
# getNamesPSID("ER17013", cwf, years = 2003) # 2003
# getNamesPSID("ER17013", cwf, years = NULL) # all years
# getNamesPSID("ER17013", cwf, years = c(2005, 2007, 2009)) # some years
# next, bring into required shape:
i = dcast(i[,list(year,name,variable)],year~name, value.var = "variable")
f = dcast(f[,list(year,name,variable)],year~name, value.var = "variable")
> head(i)
year age educ empstat weight
1: 1968 ER30004 ER30010 <NA> ER30019
2: 1969 ER30023 <NA> <NA> ER30042 # NOTICE THE NA for educ HERE!!
3: 1970 ER30046 ER30052 <NA> ER30066
4: 1971 ER30070 ER30076 <NA> ER30090
5: 1972 ER30094 ER30100 <NA> ER30116
6: 1973 ER30120 ER30126 <NA> ER30137
> head(f)
year age_youngest_child debt empstat_ faminc hours hvalue ...
1: 1968 V120 <NA> V196 V81 V47 V5 ...
2: 1969 V1013 <NA> V639 V529 V465 V449 ...
3: 1970 V1243 <NA> V1278 V1514 V1138 V1122 ...
4: 1971 V1946 <NA> V1983 V2226 V1839 V1823 ...
5: 1972 V2546 <NA> V2581 V2852 V2439 V2423 ...
6: 1973 V3099 <NA> V3114 V3256 V3027 V3021 ...
# call the builder function
d = build.panel(datadir=datadr,fam.vars=f,ind.vars=i, heads.only = TRUE,sample="SRC",design="all")
# d contains your panel
save(d,file="~/psid.Rds")
Here are some tests:
# one year test, no ind file
# call function `small.test.noind()`
# get var names from cross walk
cwf = openxlsx::read.xlsx(system.file(package="psidR","psid-lists","psid.xlsx"))
head_age_var_name <- getNamesPSID("ER17013", cwf, years=c(2003))
# create family vars data.frame
famvars = data.frame(year=c(2003),variable=head_age_var_name$variable)
# call function
build.panel(fam.vars=famvars,datadir=dd)
# one year test, ind file
# call function `small.test.ind()`
cwf = openxlsx::read.xlsx(system.file(package="psidR","psid-lists","psid.xlsx"))
head_age_var_name <- getNamesPSID("ER17013", cwf, years=c(2003))
educ = getNamesPSID("ER30323",cwf,years=2003)
famvars = data.frame(year=c(2003),variable=head_age_var_name$variable)
indvars = data.frame(year=c(2003),variable=educ$variable)
build.panel(fam.vars=famvars,ind.vars=indvars,datadir=dd)
# three year test, ind file
# call function `medium.test.ind()`
cwf = openxlsx::read.xlsx(system.file(package="psidR","psid-lists","psid.xlsx"))
head_age_var_name <- getNamesPSID("ER17013", cwf, years=c(2003,2005,2007))
educ = getNamesPSID("ER30323",cwf,years=c(2003,2005,2007))
famvars = data.frame(year=c(2003,2005,2007),variable=head_age_var_name$variable)
indvars = data.frame(year=c(2003,2005,2007),variable=educ$variable)
build.panel(fam.vars=famvars,ind.vars=indvars,datadir=dd)
# etc for
medium.test.noind()
# example output:
INFO [2018-10-10 10:58:23] Will download missing datasets now
INFO [2018-10-10 10:58:23] will download family files: 2003, 2005, 2007
This can take several hours/days to download.
want to go ahead? give me 'yes' or 'no'.yes
please enter your PSID username: *******
please enter your PSID password: *******
INFO [2018-10-10 10:58:46] downloading file ~/psid/FAM2003ER
INFO [2018-10-10 10:58:50] now reading and processing SAS file ~/psid/FAM2003ER into R
INFO [2018-10-10 11:07:02] downloading file ~/psid/FAM2005ER
INFO [2018-10-10 11:07:05] now reading and processing SAS file ~/psid/FAM2005ER into R
INFO [2018-10-10 11:14:44] downloading file ~/psid/FAM2007ER
INFO [2018-10-10 11:14:48] now reading and processing SAS file ~/psid/FAM2007ER into R
INFO [2018-10-10 11:28:25] finished downloading files to ~/psid/
INFO [2018-10-10 11:28:25] continuing now to build the dataset
INFO [2018-10-10 11:28:25] psidR: Loading Family data from .rda files
INFO [2018-10-10 11:28:34] psidR: loaded individual file: ~/psid/IND2015ER.rda
INFO [2018-10-10 11:28:34] psidR: total memory load in MB: 1252
INFO [2018-10-10 11:28:34]
INFO [2018-10-10 11:28:34] psidR: currently working on data for year 2003
INFO [2018-10-10 11:28:36]
INFO [2018-10-10 11:28:36] psidR: currently working on data for year 2005
INFO [2018-10-10 11:28:37]
INFO [2018-10-10 11:28:37] psidR: currently working on data for year 2007
INFO [2018-10-10 11:28:39] balanced design reduces sample from 97377 to 89571
INFO [2018-10-10 11:28:39] End of build.panel
> x
age interview ID1968 pernum sequence relation.head pid year
1: 92 1 848 2 1 10 848002 2003
2: 64 2 1173 1 1 10 1173001 2003
3: 48 3 1866 32 2 30 1866032 2003
4: 48 3 1866 171 1 10 1866171 2003
5: 48 3 1866 175 0 0 1866175 2003
---
89567: 49 8332 6069 4 2 20 6069004 2007
89568: 49 8332 6069 30 0 0 6069030 2007
89569: 49 8332 6069 171 3 33 6069171 2007
89570: 49 8332 6069 173 1 10 6069173 2007
89571: 49 8332 6069 174 0 0 6069174 2007
# etc for
medium.test.ind.NA()
the main function in the package is build.panel
and it has a reproducible example which you can look at by typing
require(psidR)
example(build.panel)
The PSID has a wealth of add-on datasets. Once you have a panel those are easy to merge on. The panel will have a variable interview
, which is the identifier in the supplemental dataset.
If you use psidR
in your work, please consider citing it. You could just do
> citation(package="psidR")
To cite the 'psidR' package in publications use:
Florian Oswald (2021). psidR: Build Panel Data Sets from PSID Raw Data. R package version
2.1.
A BibTeX entry for LaTeX users is
@Manual{,
title = {psidR: Build Panel Data Sets from PSID Raw Data},
author = {Florian Oswald},
year = {2021},
note = {R package version 2.1},
url = {https://github.com/floswald/psidR},
}
Thanks!