查看原文
其他

哈佛教授因果推断经典之作推荐!通过数据,代码和示例手把手教你!

凡是搞计量经济的,都关注这个号了

箱:econometrics666@126.com

所有计量经济圈方法论丛的code程序, 宏微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问.

关于因果推断书籍,参看:1.一本最新因果推断书籍, 包括了机器学习因果推断方法, 学习主流和前沿方法,2.社会经济政策的评估计量经济学, 提供书籍和数据和程序文件,3.诺奖得主Angrist的因果推断课程文献读物单子再次更新了, 还提供了其他三门课程,4.全面且前沿的因果推断课程, 提供视频, 课件, 书籍和经典文献,5.从网页上直接复制代码的因果推断书籍出现了, 学会主流方法成效极快,6.推荐书籍"用R软件做应用因果分析", 有需要的学者可以自行下载!7.哪本因果推断书籍最好?我们给你整理好了这个书单!8.“不一样”的因果推断书籍, 很多观点让我们能恍然大悟, 涵盖了不少其他书里没有的因果推断方法!9.搞懂因果推断中内生性问题解决方法必读的书籍和文献已搜集好!10.一位“诗人”教授写了本因果推断书籍, 现在可以直接下载PDF参看!11.使用R软件学习计量经济学方法三本书籍推荐,12.机器学习与Econometrics的书籍推荐, 值得拥有的经典,13.史上最全的因果识别经典前沿书籍, 仅此一份,14.用R语言做Econometrics的书籍推荐, 值得拥有的经典,15.Stata学习的书籍和材料大放送, 以火力全开的势头,16.USA经管商博士最狂热崇拜的计量书籍震撼出炉,17.推荐使用Python语言做因果推断前沿方法的书籍

计量经济圈公众号搜索功能及操作流程演示

正文


今天的主题,是引荐一本由哈佛大学教授新修订完成的因果推断经典大作“Causal Inference: What If”。除了这本之外,还有图灵奖得主Judea Pearl新书《The book of Why: the new science of cause and effect》也值得一读。

《因果推段》是一本书公认的自命不凡的书名。因果推段是一项复杂的科学任务,它需要研究者从多个来源挖掘有效证据,并依赖于各种方法论的恰当应用。任何一本书都不可能全面地描述跨学科因果推断的方法论。任何因果推断书的作者都必须选择他们想要强调的因果推断方法论方面。


本导言的标题反映了我们自己的选择:这本书帮助科学家——特别是健康和社会科学家——生成和分析数据,从而对因果问题和数据分析的假设做出明确的因果推论。不幸的是,科学文献中充斥着一些研究,其中没有明确说明因果问题,也没有列出研究人员无法证实的假设。这种对因果推断的随意态度,产生了大量结论混乱的研究文章。例如,由于数据分析方法不能在研究者的假设下恰当地回答因果问题,因此发现效应估计难以解释的研究并不少见。


在这本书中,我们强调需要认真对待因果问题,因此足以清楚地表达它,并描述数据和假设在因果推断中的不同作用。一旦掌握了这些基础知识,因果推断就变得不那么随意了,这有助于防止学者陷入混淆。这本书描述了各种数据分析方法,而这些方法可用于在一组特定假设下估计让我们感兴趣的因果效应。这本书的一个关键信息是,因果推断不能简化为数据分析方法的集合。


这本书分为三部分,难度越来越大:第一部分是关于没有模型的因果推断(即因果效应的非参数识别),第二部分是关于有模型的因果推断(即用参数模型估计因果效应),第三部分是关于复杂纵向数据(longitudinal data)的因果推断(即,估计时变处理的因果效应)。在整篇文章中,我们穿插了一些细枝末节和技术要点,详细阐述了正文中提到的那些主题。细致讲解部分是为所有读者设计的,而技术部分是为受过中级统计培训的读者设计的。这本书提供了一个全面的关于因果推断的概念和方法,尽管这些方法目前都还分散存在于几个学科的期刊中。我们期望这本书会引起那些对因果推断感兴趣的人的兴趣,例如流行病学家、统计学家、心理学家、经济学家、社会学家、政治科学家、计算机科学家。当然,作为因果推断研究小组的必读书籍,希望各位学者也能够下载这本书籍参阅。


重要的是,这不是一本哲学书。我们对形而上学的概念,如因果关系和原因,仍然持不可知论的态度。相反,我们侧重于确定和估计人群中的因果效应,即测量不同干预下结果分布变化的数值。例如,我们讨论在严重心力衰竭患者中,若他们接受了心脏移植,或他们没有接受心脏移植,如何估计两种情况下的死亡风险。我们的主要目标是帮助决策者做出更好的决策——可操作的因果推理。

作者还附上了书中的数据及R,Stata, Python等运行程序。

这里只展示一下Stata的一章code,R, SAS, Python可以自行下载。

/***************************************************************
PROGRAM 17.1
生存曲线的非参数估计
Section 17.1
***************************************************************/
clear
use "nhefs.dta"

/*Some preprocessing of the data*/
gen survtime = .
replace survtime = 120 if death == 0
replace survtime = (yrdth - 83)*12 + modth if death ==1
* yrdth ranges from 83 to 92*

tab death qsmk

/*Kaplan-Meier graph of observed survival over time, by quitting smoking*/
*For now, we use the stset function in Stata*
stset survtime, failure(death=1)
sts graph, by(qsmk) xlabel(0(12)120)


/***************************************************************
PROGRAM 17.2
通过风险模型对生存曲线进行参数估计
Section 17.1
Generates Figure 17.4
***************************************************************/

/**Create person-month dataset for survival analyses**/

/*We want our new dataset to include 1 observation per person per month alive, starting at time = 0*/
*Individuals who survive to the end of follow-up will have 119 time points*
*Individuals who die will have survtime - 1 time points*
clear
use "nhefs.dta"

gen survtime = .
replace survtime = 120 if death == 0
replace survtime = (yrdth - 83)*12 + modth if death ==1

*expand data to person-time*
gen time = 0
expand survtime if time == 0
bysort seqn: replace time = _n - 1

*Create event variable*
gen event = 0
replace event = 1 if time == survtime - 1 & death == 1
tab event

*Create time-squared variable for analyses*
gen timesq = time*time

*Save the dataset to your working directory for future use*
save nhefs_surv, replace

/**Hazard ratios**/
clear

use "nhefs_surv.dta"

*Fit a pooled logistic hazards model *
logistic event qsmk qsmk#c.time qsmk#c.time#c.time c.time c.time#c.time


/**Survival curves: run regression then do:**/

*Create a dataset with all time points under each treatment level*
*Re-expand data with rows for all timepoints*
drop if time != 0
expand 120 if time ==0
bysort seqn: replace time = _n - 1

*Create 2 copies of each subject, and set outcome to missing and treatment -- use only the newobs*
expand 2 , generate(interv)
replace qsmk = interv

*Generate predicted event and survival probabilities for each person each month in copies*
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep seqn time qsmk interv psurv_k

*Within copies, generate predicted survival over time*
*Remember, survival is the product of conditional survival probabilities in each interval*
sort seqn interv time
gen _t = time + 1
gen psurv = psurv_k if _t ==1
bysort seqn interv: replace psurv = psurv_k*psurv[_t-1] if _t >1

*Display 10-year standardized survival, under interventions*
*Note: since time starts at 0, month 119 is 10-year survival*
by interv, sort: summarize psurv if time == 119

*Graph of standardized survival over time, under interventions*
*Note, we want our graph to start at 100% survival, so add an extra time point with P(surv) = 1*
expand 2 if time ==0, generate(newtime)
replace psurv = 1 if newtime == 1
gen time2 = 0 if newtime ==1
replace time2 = time + 1 if newtime == 0
*Separate the survival probabilities to allow plotting by intervention on qsmk*
separate psurv, by(interv)
*Plot the curves*
twoway (line psurv0 time2, sort) (line psurv1 time2, sort) if interv > -1, ylabel(0.5(0.1)1.0) xlabel(0(12)120) ytitle("Survival probability") xtitle("Months of follow-up") legend(label(1 "A=0") label(2 "A=1"))


/***************************************************************
PROGRAM 17.3
Estimation of survival curves via IP weighted hazards model
Data from NHEFS
Section 17.4
Generates Figure 17.6
***************************************************************/

clear
use "nhefs_surv.dta"

keep seqn event qsmk time sex race age education smokeintensity smkintensity82_71 smokeyrs exercise active wt71
preserve

*Estimate weights*
logit qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity ///
c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 if time == 0
predict p_qsmk, pr
logit qsmk if time ==0
predict num, pr
gen sw=num/p_qsmk if qsmk==1
replace sw=(1-num)/(1-p_qsmk) if qsmk==0
summarize sw

*IP weighted survival by smoking cessation*
logit event qsmk qsmk#c.time qsmk#c.time#c.time c.time c.time#c.time [pweight=sw] , cluster(seqn)

*Create a dataset with all time points under each treatment level*
*Re-expand data with rows for all timepoints*
drop if time != 0
expand 120 if time ==0
bysort seqn: replace time = _n - 1

*Create 2 copies of each subject, and set outcome to missing and treatment -- use only the newobs*
expand 2 , generate(interv)
replace qsmk = interv

*Generate predicted event and survival probabilities for each person each month in copies*
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep seqn time qsmk interv psurv_k

*Within copies, generate predicted survival over time*
*Remember, survival is the product of conditional survival probabilities in each interval*
sort seqn interv time
gen _t = time + 1
gen psurv = psurv_k if _t ==1
bysort seqn interv: replace psurv = psurv_k*psurv[_t-1] if _t >1

*Display 10-year standardized survival, under interventions*
*Note: since time starts at 0, month 119 is 10-year survival*
by interv, sort: summarize psurv if time == 119

quietly summarize psurv if(interv==0 & time ==119)
matrix input observe = (0,`r(mean)')
quietly summarize psurv if(interv==1 & time ==119)
matrix observe = (observe \1,`r(mean)')
matrix observe = (observe \3, observe[2,2]-observe[1,2])
matrix list observe


*Graph of standardized survival over time, under interventions*
*Note: since our outcome model has no covariates, we can plot psurv directly. If we had covariates we would need to stratify or average across the values*
expand 2 if time ==0, generate(newtime)
replace psurv = 1 if newtime == 1
gen time2 = 0 if newtime ==1
replace time2 = time + 1 if newtime == 0
separate psurv, by(interv)
twoway (line psurv0 time2, sort) (line psurv1 time2, sort) if interv > -1, ylabel(0.5(0.1)1.0) xlabel(0(12)120) ytitle("Survival probability") xtitle("Months of follow-up") legend(label(1 "A=0") label(2 "A=1"))
*remove extra timepoint*
drop if newtime == 1
drop time2

restore

**Bootstraps**
save nhefs_std1 , replace

capture program drop bootipw_surv
program define bootipw_surv , rclass
u nhefs_std1 , clear
preserve
bsample, cluster(seqn) idcluster(newseqn)

logit qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity ///
c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 if time == 0
predict p_qsmk, pr
logit qsmk if time ==0
predict num, pr
gen sw=num/p_qsmk if qsmk==1
replace sw=(1-num)/(1-p_qsmk) if qsmk==0

logit event qsmk qsmk#c.time qsmk#c.time#c.time c.time c.time#c.time [pweight=sw] , cluster(newseqn)

drop if time != 0
expand 120 if time ==0
bysort newseqn: replace time = _n - 1
expand 2 , generate(interv_b)
replace qsmk = interv_b

predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep newseqn time qsmk interv_b psurv_k

sort newseqn interv_b time
gen _t = time + 1
gen psurv = psurv_k if _t ==1
bysort newseqn interv_b: replace psurv = psurv_k*psurv[_t-1] if _t >1
drop if time != 119
bysort interv_b: egen meanS_b = mean(psurv)
keep newseqn qsmk meanS_b
drop if newseqn != 1 /* only need one pair */

drop newseqn

return scalar boot_0 = meanS_b[1]
return scalar boot_1 = meanS_b[2]
return scalar boot_diff = return(boot_1) - return(boot_0)
restore
end
set rmsg on
simulate PrY_a0 = r(boot_0) PrY_a1 = r(boot_1) difference=r(boot_diff) , reps(10) seed(1) : bootipw_surv
set rmsg off

matrix pe = observe[1..3, 2]'
bstat, stat(pe) n(1629)


/***************************************************************
PROGRAM 17.4
通过g公式估算生存曲线
Section 17.5
Generates Figure 17.7
***************************************************************/
clear
use "nhefs_surv.dta"

keep seqn event qsmk time sex race age education smokeintensity smkintensity82_71 smokeyrs exercise active wt71
preserve

quietly logistic event qsmk qsmk#c.time qsmk#c.time#c.time time c.time#c.time ///
sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity ///
c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 , cluster(seqn)

drop if time != 0
expand 120 if time ==0
bysort seqn: replace time = _n - 1
expand 2 , generate(interv)
replace qsmk = interv
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep seqn time qsmk interv psurv_k
sort seqn interv time
gen _t = time + 1
gen psurv = psurv_k if _t ==1
bysort seqn interv: replace psurv = psurv_k*psurv[_t-1] if _t >1
by interv, sort: summarize psurv if time == 119

keep qsmk interv psurv time

bysort interv : egen meanS = mean(psurv) if time == 119
by interv: summarize meanS

quietly summarize meanS if(qsmk==0 & time ==119)
matrix input observe = ( 0,`r(mean)')
quietly summarize meanS if(qsmk==1 & time ==119)
matrix observe = (observe \1,`r(mean)')
matrix observe = (observe \2, observe[2,2]-observe[1,2])
*Add some row/column descriptions and print results to screen*
matrix rownames observe = P(Y(a=0)=1) P(Y(a=1)=1) difference
matrix colnames observe = interv survival


*Graph standardized survival over time, under interventions*
*Note: unlike in PROGRAM 17.3, we now have covariates so we first need to average survival across strata*
bysort interv time : egen meanS_t = mean(psurv)
*Now we can continue with the graph*
expand 2 if time ==0, generate(newtime)
replace meanS_t = 1 if newtime == 1
gen time2 = 0 if newtime ==1
replace time2 = time + 1 if newtime == 0
separate meanS_t, by(interv)
twoway (line meanS_t0 time2, sort) (line meanS_t1 time2, sort), ylabel(0.5(0.1)1.0) xlabel(0(12)120) ytitle("Survival probability") xtitle("Months of follow-up") legend(label(1 "A=0") label(2 "A=1"))
*remove extra timepoint*
drop if newtime == 1

restore

*Bootstraps*
save nhefs_std2 , replace

capture program drop bootstdz_surv
program define bootstdz_surv , rclass
u nhefs_std2 , clear
preserve
bsample, cluster(seqn) idcluster(newseqn)
logistic event qsmk qsmk#c.time qsmk#c.time#c.time time c.time#c.time ///
sex race c.age##c.age ib(last).education ///
c.smokeintensity##c.smokeintensity c.smkintensity82_71 ///
c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71
drop if time != 0
/*only predict on new version of data */
expand 120 if time ==0
bysort newseqn: replace time = _n - 1
expand 2 , generate(interv_b)
replace qsmk = interv_b
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep newseqn time qsmk psurv_k
sort newseqn qsmk time
gen _t = time + 1
gen psurv = psurv_k if _t ==1
bysort newseqn qsmk: replace psurv = psurv_k*psurv[_t-1] if _t >1
drop if time != 119 /* keep only last observation */
keep newseqn qsmk psurv
/* if time is in data for complete graph add time to bysort */
bysort qsmk : egen meanS_b = mean(psurv)
keep newseqn qsmk meanS_b
drop if newseqn != 1 /* only need one pair */

drop newseqn

return scalar boot_0 = meanS_b[1]
return scalar boot_1 = meanS_b[2]
return scalar boot_diff = return(boot_1) - return(boot_0)
restore
end
set rmsg on
simulate PrY_a0 = r(boot_0) PrY_a1 = r(boot_1) difference=r(boot_diff) , reps(10) seed(1) : bootstdz_surv
set rmsg off

matrix pe = observe[1..3, 2]'
bstat, stat(pe) n(1629)

Source:

https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

长按上方二维码可以读该书


关于因果推断,可参看关于各种因果识别方法的120份经典实证文献汇总”,21年诺奖得主那些年关于教育的研究, 学习其中的因果推断方法!因果推断的统计方法总结, 177份文献政策评估的计量方法综述, 包括最新因果推断方法在教育领域使用IV, RDD, DID, PSM多吗? 使用具体文献,看完顶级期刊文章后, 整理了内生性处理小册子工具变量精辟解释, 保证你一辈子都忘不了DID, 合成控制, 匹配, RDD四种方法比较, 适用范围和特征关于双重差分法DID的32篇精选Articles专辑!关于(模糊)断点回归设计的100篇精选Articles专辑!匹配方法(matching)操作指南, 值得收藏的16篇文章等,MIT广为流传的政策"处理效应"读本DID的研究动态和政策评估中应用的文献综述最新政策效应评估的四种方法政策效应评估的基本问题等。

1.用"因果关系图"来进行因果推断的新技能2.因果推断专题:因果图3.因果推断专题:有向无环图DAG4.confounder与collider啥区别? 混淆 vs 对撞5.三张图秒懂, 混淆, 中介, 调节, 对撞, 暴露, 结果和协变量的复杂关系6.中介效应检验流程, 示意图公布, 不再畏惧中介分析7.图灵奖得主Pearl的因果推断新科学,Book of Why?  8.前沿: nature刊掀起DAG热, 不掌握就遭淘汰无疑!因果关系研究的图形工具!9.前沿: 卫星数据在实证研究中的应用, 用其开展因果推断的好处!10.7大因果推断大法精选实证论文, 可用于中国本土博士课堂教学!11.随机分配是什么, 为什么重要, 对因果关系影响几何?12.应用计量经济学现状: 因果推断与政策评估最全综述13.疫情期计量课程免费开放!面板数据, 因果推断, 时间序列分析与Stata应用14.Python做因果推断的方法示例, 解读与code15.内生转换模型vs内生处理模型vs样本选择模型vs工具变量2SLS16.不用IV, 基于异方差识别方法解决内生性, 赐一篇文献等等。


关于因果推断书籍:哈佛大学新修订完成的因果推断经典大作免费下载!附数据和code!图灵奖得主Pearl的因果推断新科学, Why?计量课程免费开放!面板数据, 因果推断, 时间序列分析与Stata应用(慕课上有不少免费课程,建议年轻学者好好使用),④你应该阅读哪本因果推断书籍: 一份进阶流程图和简短书评列表

下这些短链接文章属于合集,可以收藏起来阅读,不然以后都找不到了。

3.5年,计量经济圈近1000篇不重类计量文章,

可直接在公众号菜单栏搜索任何计量相关问题,

Econometrics Circle




数据系列空间矩阵 | 工企数据 | PM2.5 | 市场化指数 | CO2数据 |  夜间灯光 | 官员方言  | 微观数据 | 内部数据计量系列匹配方法 | 内生性 | 工具变量 | DID | 面板数据 | 常用TOOL | 中介调节 | 时间序列 | RDD断点 | 合成控制 | 200篇合辑 | 因果识别 | 社会网络 | 空间DID数据处理Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |干货系列能源环境 | 效率研究 | 空间计量 | 国际经贸 | 计量软件 | 商科研究 | 机器学习 | SSCI | CSSCI | SSCI查询 | 名家经验计量经济圈组织了一个计量社群,有如下特征:热情互助最多前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此,建议积极进取和有强烈研习激情的中青年学者到社群交流探讨,始终坚信优秀是通过感染优秀而互相成就彼此的。

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存