查看原文
其他

爬虫实战程序的函数封装

数据分析团队 Stata and Python数据分析 2022-03-15

本文作者:陈志玲

文字编辑:余术玲

技术总编:张   邯


在之前的《爬虫实战——聚募网股权众筹信息爬取》一文中我们介绍了如何使用requests和time 库来爬取网站信息,尽管我们实现了这一目标,但冗长的程序让人看起来杂乱无章。为了让程序看起来简洁美观,可读易懂,我们今天来介绍如何将其封装成函数,最后只需要调用函数获取信息。
在Python中,定义一个函数需要使用def语句,依次写出函数名、括号、括号中的参数和冒号,然后,在缩进块中编写函数体,函数的返回值用return语句返回。即:

def<函数名>(参数列表):

    <函数体>

    return<返回值列表>

首先我们还是需要先导入三个库:
import requests
import timeimport json
接下来我们开始将获取一个页面的程序进行封装:
#首先定义一个函数名 GetPageProjectInfo(只使用字母和下划线),并传入参数(该函数为完成任务所需要的信息),此处即为页数pagedef GetPageProjectInfo(page): #此处headers在完整程序中查看 timestamp= int(round(time.time()*1000)) url=f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset={page}&keyword=&_={timestamp}" raw_html= requests.get(url,headers= headers)#requests.get(url, params=None, **Kwargs) html_text= raw_html.text return json.loads(html_text)["data"]["list"] #return语句返回一个我们所需要的值
现在我们已经定义了GetPageProjectInfo()函数,只是在内存中声明了一个函数,如果我们想调用该函数获取第一个页面的信息,则向GetPageProjectInfo()传入参数1,并将其返回值打印出来。如下所示:

接下来,我们开始封装第二段——获取所有页面id,同样先定义一个函数名:
def GetAllProjectInfo(): #该函数无需向其传入参数,即可调用 ProjectInfo = [] for i in range(1,23):#将GetPageProjectInfo(i)返回的值拼接到空列表ProjectInfo中 ProjectInfo.extend(GetPageProjectInfo(i))    return ProjectInfo

按照《爬虫实战——聚募网股权众筹信息爬取》的顺序,我们接下来需要将GetAllProjectInfo()返回内容写入csv文件的程序进行封装,定义一个函数Json2csv_Project(Info,VarName,FileName),并向其传入参数,Info即上述函数返回的ProjectInfo(所有项目的信息),VarName即传入变量名,作为csv文件的表头,FileName即我们最后获得的文件名,具体过程如下:
def Json2csv_Project(Info,VarName,FileName): with open(FileName,"w",encoding ="gb18030") as f: f.write("\t".join(VarName)+"\n") for EachInfo in Info: tempInfo = [] for key in VarName: if key in EachInfo: tempInfo.append(str(EachInfo[key]).replace("\n","").replace("\t","").replace("\r","")) else: tempInfo.append("") f.write("\t".join(tempInfo)+"\n")
然后我们需要从csv文件中提取id,并将提取id的过程进行封装:
def GetId(FileName): #将csv文件作为参数传入
with open(FileName,"r",encoding ="gb18030") as f: final_Info = f.readlines() ProjectId = [] for i in range(1,len(final_Info)): ProjectId.append(final_Info[i].split("\t")[0]) return ProjectId #返回所有项目的id
获得id之后,我们依旧按原来步骤,继续爬取第二层页面的信息,定义一个函数GetTeamInfo(),只是此处我们进行了函数封装。我们可以直接将上述函数获得ProjectId当作参数传入,无需再像之前爬取那一篇中所写的那样对198个id进行遍历。具体过程如下:
def GetTeamInfo(ProjectId): timestamp = int(round(time.time()*1000)) url =f"https://www.dreammove.cn/project/project_team/id/{ProjectId}?_={timestamp}" #此处headers在完整程序中查看 raw_html = requests.get(url,headers =headers) html_text = raw_html.text return json.loads(html_text)["data"]["team_list"]
现在,我们调用上述函数得到每个项目的团队信息,然后对其进行拼接,汇总到一个列表中。为了可以随时查看我们爬取的信息,今天我们多做一步,将所获得的信息写入csv文件进行保存。此处与上述将所有项目信息写入csv文件类似。具体程序如下:
#Team_Id即所有id,VarNamem即变量名,最终的表头,FileName即最终的csv文件def Json2csv_Team(Team_Id,VarNamem,FileName): with open(FileName,"w",encoding = "gb18030") as f:#按gb18030编码往FileName中写入内容 f.write("id\t"+"\t".join(VarName)+"\n") for Eachid in Team_Id : TeamInfo = GetTeamInfo(Eachid)#利用函数GetTeamInfo(id)获得团队信息 if TeamInfo.__class__ == list: #判断上一步获得的TeamInfo是否是列表 for Eachperson in TeamInfo :#如果是列表,再对列表TeamInfo进行遍历,输出每一个元素 print(Eachperson) tempInfo = [Eachid]#放置第一层遍历的id for key in VarName: #对变量名进行遍历 if key in Eachperson:#如果是 键列表TeamInfo中的元素 tempInfo.append(str(Eachperson[key]).replace("\n","").replace("\t","").replace("\r",""))#列表TeamInfo中的元素——字典的键转化成字符串 去掉换行符,制表符,回车符 else: tempInfo.append("") f.write("\t".join(tempInfo)+"\n")
至此,我们已经封装好所有程序,接下来就是调用所有函数,即所谓的主程序。
#用来让脚本判断自己是被当做模块调用还是被直接运行,当被import作为模块调用时if以下的代码就不会被执行if __name__ == "__main__":#调用GetAllProjectInfo()函数获得项目id,在调用Json2csv_Project函数获得csv文件Project_FileName VarName =['id','update_time','province_name','subsite_id','is_open','industry','type','open_flag','project_name','step','seo_string','abstract','cover','project_phase','member_count','province','city','address','company_name','project_url','uid','over_time','vote_leader_step','stage','is_agree','is_del','agreement_id','barcode','sort','display_subsite_id','need_fund','real_fund','project_valuation','final_valuation','min_lead_fund','min_follow_fund','total_fund','agree_total_fund','leader_flag','leader_id','read_cnt','follow_cnt','inverstor_cnt','comment_cnt','nickname','short_name','site_url','site_logo','storelevel','industry_name'] Project_FileName ="C:\\CrowdFunding\\dreammove\\ProjectInfo.csv" Json2csv_Project(GetAllProjectInfo(),VarName,Project_FileName) #调用GetAllProjectInfo()函数获得项目id,在调用Json2csv_Project函数获得csv文件Project_FileName #获取二级团队信息,存入TeamInfo.csv文件中 VarName =['name','duty','src','intro','is_fulltime','relationship','short_intro','shared_rate','amount','member_type'] Team_FileName ="C:\\CrowdFunding\\dreammove\\TeamInfo.csv"#GetId函数获取Project_FileName中的项目id 再用Json2csv_Team函数获得团队信息 Json2csv_Team(GetId(Project_FileName),VarName,Team_FileName)
最后总结一下,在定义函数时,我们需要确定函数名和参数个数。其实函数就是组织好的、可以重复利用的、用来实现单一或相关功能的代码段。它能提高代码的重复利用率,保护一致性,易维护,可扩展性强。所以,我们需要将一大段看起来杂乱无章的代码进行封装。
最后附上完整程序
import requestsimport timeimport json def GetPageProjectInfo(page): headers = {"Accept":"application/json, text/javascript, */*; q=0.01", "Referer":"https://www.dreammove.cn/list/index.html?industry=0&type=8&city=0", "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100Mobile Safari/537.36", "X-Requested-With": "XMLHttpRequest"} timestamp =int(round(time.time()*1000)) url =f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset={page}&keyword=&_={timestamp}" raw_html =requests.get(url,headers= headers) html_text =raw_html.text return json.loads(html_text)["data"]["list"]#print(GetPageProjectInfo(1))def GetAllProjectInfo(): ProjectInfo =[] for i inrange(1,23): ProjectInfo.extend(GetPageProjectInfo(i)) return ProjectInfo#print(GetAllProjectInfo())def Json2csv_Project(Info,VarName,FileName): with open(FileName,"w",encoding = "gb18030") as f: f.write("\t".join(VarName)+"\n") for EachInfo in Info: tempInfo = [] for key in VarName: if key in EachInfo: tempInfo.append(str(EachInfo[key]).replace("\n","").replace("\t","").replace("\r","")) else: tempInfo.append("") f.write("\t".join(tempInfo)+"\n")def GetId(FileName): with open(FileName,"r",encoding = "gb18030") as f: final_Info = f.readlines() ProjectId= [] for i in range(1,len(final_Info)): ProjectId.append(final_Info[i].split("\t")[0]) return ProjectIddef GetTeamInfo(ProjectId): timestamp =int(round(time.time()*1000)) url =f"https://www.dreammove.cn/project/project_team/id/{ProjectId}?_={timestamp}" headers ={"Accept": "application/json, text/javascript, */*;q=0.01", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Connection": "keep-alive", "Cookie": "PHPSESSID=bk25qacnlg8g68d205h4pqeq56;Hm_lvt_c18b08cac9b94bf4628c0277d3a4d7de=1562549437;,jumu_web_idu=MDAwMDAwMDAwMLGGhpiGr36zsa96r7WEvXE;jumu_web_idp=MDAwMDAwMDAwMMafpd-afJ2NtZ9-r7OXoXE;Hm_lpvt_c18b08cac9b94bf4628c0277d3a4d7de=1562561558", "Host": "www.dreammove.cn", "Referer":"https://www.dreammove.cn/project/detail/id/97", "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100Mobile Safari/537.36", "X-Requested-With": "XMLHttpRequest"} raw_html =requests.get(url,headers = headers) html_text =raw_html.text return json.loads(html_text)["data"]["team_list"]def Json2csv_Team(Team_Id,VarNamem,FileName): with open(FileName,"w",encoding = "gb18030") as f: f.write("id\t"+"\t".join(VarName)+"\n") for Eachid in Team_Id : #print(eachid) TeamInfo = GetTeamInfo(Eachid) if TeamInfo.__class__ == list: for Eachperson in TeamInfo : print(Eachperson) tempInfo = [Eachid] for key in VarName: if key in Eachperson: tempInfo.append(str(Eachperson[key]).replace("\n","").replace("\t","").replace("\r","")) else: tempInfo.append("") f.write("\t".join(tempInfo)+"\n")if __name__ == "__main__": VarName =['id','update_time','province_name','subsite_id','is_open','industry','type','open_flag','project_name','step','seo_string','abstract','cover','project_phase','member_count','province','city','address','company_name','project_url','uid','over_time','vote_leader_step','stage','is_agree','is_del','agreement_id','barcode','sort','display_subsite_id','need_fund','real_fund','project_valuation','final_valuation','min_lead_fund','min_follow_fund','total_fund','agree_total_fund','leader_flag','leader_id','read_cnt','follow_cnt','inverstor_cnt','comment_cnt','nickname','short_name','site_url','site_logo','storelevel','industry_name'] Project_FileName = "C:\\CrowdFunding\\dreammove\\ProjectInfo.csv" Json2csv_Project(GetAllProjectInfo(),VarName,Project_FileName) VarName =['name','duty','src','intro','is_fulltime','relationship','short_intro','shared_rate','amount','member_type'] Team_FileName= "C:\\CrowdFunding\\dreammove\\TeamInfo.csv" Json2csv_Team(GetId(Project_FileName),VarName,Team_FileName)

对我们的推文累计打赏超过1000元,我们即可给您开具发票,发票类别为“咨询费”。用心做事,不负您的支持!
往期推文推荐
       Zipfile(二)
利用collapse命令转化原始数据
Stata中的数值型
爬虫实战——聚募网股权众筹信息爬取
duplicates drop之前,我们要做什么?
类型内置函数-type() isinstance()
数据含义记不住?—— label“大神”来帮忙

实战演练-如何获取众筹项目的团队信息

Zipfile(一)

tabplot命令

Jupyter Notebook不为人知的秘密

字符串方法(三)

数据,我要“拷打”你

encode 和decode——带你探索编码与解码的世界

字符串方法(二)

如何快速生成分组变量?

用Stata实现数据标准化


关于我们

微信公众号“Stata and Python数据分析”分享实用的stata、python等软件的数据处理知识,欢迎转载、打赏。我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队。

此外,欢迎大家踊跃投稿,介绍一些关于stata和python的数据处理和分析技巧。
投稿邮箱:statatraining@163.com
投稿要求:
1)必须原创,禁止抄袭;
2)必须准确,详细,有例子,有截图;
注意事项:
1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。
2)邮件请注明投稿,邮件名称为“投稿+推文名称”。
3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存