介绍一个小项目,通过爬虫爬取贴吧评论并保存到本地
导入相关的库 1 2 3 4 import csvimport requestsimport reimport time
打开百度贴吧随便选择一篇帖子,检查网页源代码,我们发现评论信息都在源代码中,所以直接解析网页源代码即可
用requests
发送请求 1 2 3 4 5 6 url = 'https://tieba.baidu.com/p/9285638137?pn=1' headers = { 'user-agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' } resp = requests.get(url,headers = headers)
我们用requests模拟浏览器向目标url发送了一次请求并且拿到了服务器的响应resp
。下一步的工作就是分析resp
中的内容,拿到我们想要的数据。
解析数据 1 2 3 4 5 6 7 8 html = resp.text comments = re.findall('style="display:;"> (.*?)</div>' ,html) users = re.findall('class="p_author_name j_user_card" href=".*?" target="_blank">(.*?)</a>' ,html) comment_times = re.findall('楼</span><span class="tail-info">(.*?)</span><div' ,html)
保存到本地 1 2 3 4 5 6 7 8 9 10 with open ('01.csv' , 'a' , encoding='utf-8' ) as f: csvwriter = csv.writer(f) for u, c, t in zip (users, comments, comment_times): if 'img' in c or 'div' in c or len (u) > 50 : continue csvwriter.writerow((u,t,c)) print (u, t, c) csvwriter.writerow(('评论用户' , '评论时间' , '评论内容' ))
以下是完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import csvimport requestsimport reimport timeurl = 'https://tieba.baidu.com/p/9285638137?pn=1' headers = { 'user-agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' } resp = requests.get(url,headers = headers) html = resp.text comments = re.findall('style="display:;"> (.*?)</div>' ,html) users = re.findall('class="p_author_name j_user_card" href=".*?" target="_blank">(.*?)</a>' ,html) comment_times = re.findall('楼</span><span class="tail-info">(.*?)</span><div' ,html) with open ('01.csv' , 'a' , encoding='utf-8' ) as f: csvwriter = csv.writer(f) for u, c, t in zip (users, comments, comment_times): if 'img' in c or 'div' in c or len (u) > 50 : continue csvwriter.writerow((u,t,c)) print (u, t, c) csvwriter.writerow(('评论用户' , '评论时间' , '评论内容' ))
我们发现这样只能拿到第一页的评论内容 检查网页url:https://tieba.baidu.com/p/9285638137?pn=1 翻页之后网页url变成了:https://tieba.baidu.com/p/9285638137?pn=2 区别在于pn=1/2 我们可以修改代码循环遍历所有页面,代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import csvimport requestsimport reimport timedef main (page ): url = f'https://tieba.baidu.com/p/9285638137?pn={page} ' headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36' } resp = requests.get(url,headers=headers) html = resp.text comments = re.findall('style="display:;"> (.*?)</div>' ,html) users = re.findall('class="p_author_name j_user_card" href=".*?" target="_blank">(.*?)</a>' ,html) comment_times = re.findall('楼</span><span class="tail-info">(.*?)</span><div' ,html) for u,c,t in zip (users,comments,comment_times): if 'img' in c or 'div' in c or len (u)>50 : continue csvwriter.writerow((u,t,c)) print (u,t,c) print (f'第{page} 页爬取完毕' ) if __name__ == '__main__' : with open ('01.csv' ,'a' ,encoding='utf-8' )as f: csvwriter = csv.writer(f) csvwriter.writerow(('评论用户' ,'评论时间' ,'评论内容' )) for page in range (1 ,8 ): main(page) time.sleep(2 )
该项目还有一个小问题,那就是正则表达式在筛选评论的过程中没有拿到带表情的评论。