requests+xpath+json爬取糗事百科
json
适用于现代 C++ 的 JSON。
项目地址:https://gitcode.com/gh_mirrors/js/json
data:image/s3,"s3://crabby-images/252a9/252a9c8983bd405ae62c012271dcfc349d4485c2" alt=""
·
(1) requests:数据爬取,import requests
(2) lxml中的xpath:数据解析,from lxml import etree
(3) json:数据存储,import json
下面直接放代码:
# json + lxml + xpath + requests 爬取 “糗事百科”
from lxml import etree
import requests
import json
class QiuShiBK(object):
def __init__(self):
self.init_url = "https://www.qiushibaike.com/text/page/{}/"
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.13 Safari/537.36'}
def get_url_list(self): #根据url地址的规律,通过列表生成式 构造url_list
url_list = [self.init_url.format(i) for i in range(1, 14)]
return url_list
def parse_url(self,url):#发送requests请求
response = requests.get(url,headers=self.headers)
response.raise_for_status()
return response.content.decode('UTF-8')
def get_content_list(self,html_str):
content_list = []
html = etree.HTML(html_str)
#1. 按div分组
div_list = html.xpath("//div[@class='col1 old-style-col1']/div")
for div in div_list:
item = {}
item["girl_name"] = div.xpath(".//h2/text()")[0].strip() if len(div.xpath(".//h2/text()"))>0 else None
item["girl_content"] = div.xpath(".//div[@class='content']/span/text()") #因为不知道列表中是否只有一个元素,所以不同其他几个 直接【0】
item["girl_content"] = [i.strip() for i in item["girl_content"]]
item["girl_content"] = "".join(item["girl_content"])
item["girl_laugh"] = div.xpath(".//span[@class='stats-vote']/i/text()")[0].strip() #if len(div.xpath(".//span[@class='stats-vote']/i/text()"))>0 else None
item["girl_comment"] = div.xpath(".//span[@class='stats-comments']//i/text()")[0] #if len(div.xpath(".//span[@class='stats-comments']//i/text()"))>0 else None
content_list.append(item)
return content_list
def save_content_list(self,content_list):
with open('qiushi.txt','a',encoding='UTF-8') as f:
for content in content_list:
f.write(json.dumps(content,ensure_ascii=False))
f.write("\n")
print("保存成功")
def run(self):
#1.根据url地址规律,拼接url
#2.发送requests请求
#3.提取数据
#4.保存数据
for url in self.get_url_list(): #1.根据url地址规律,拼接url
html_str = self.parse_url(url) #2.发送requests请求
content_list = self.get_content_list(html_str) #3.提取数据
self.save_content_list(content_list) #4.保存数据
if __name__ == '__main__':
qiushi = QiuShiBK()
qiushi.run()
data:image/s3,"s3://crabby-images/82b91/82b916b750c2552f6144dd084ffb3b00aec8256e" alt=""
data:image/s3,"s3://crabby-images/7f201/7f2016107c9c7d29fde5253f8369e0944698c15e" alt=""
data:image/s3,"s3://crabby-images/2dfbb/2dfbb4017bd988b860b125cdb8de6b804dabeb6f" alt=""
data:image/s3,"s3://crabby-images/252a9/252a9c8983bd405ae62c012271dcfc349d4485c2" alt=""
适用于现代 C++ 的 JSON。
最近提交(Master分支:4 个月前 )
f06604fc
* :page_facing_up: bump the copyright years
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
* :page_facing_up: bump the copyright years
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
* :page_facing_up: bump the copyright years
Signed-off-by: Niels Lohmann <niels.lohmann@gmail.com>
---------
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
Signed-off-by: Niels Lohmann <niels.lohmann@gmail.com> 1 个月前
d23291ba
* add a ci step for Json_Diagnostic_Positions
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* Update ci.cmake to address review comments
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* address review comment
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix typo in the comment
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix typos in ci.cmake
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* invoke the new ci step from ubuntu.yml
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* issue4561 - use diagnostic positions for exceptions
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix ci_test_documentation check
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* address review comments
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix ci check failures for unit-diagnostic-postions.cpp
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* improvements based on review comments
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix const correctness string
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* further refinements based on reviews
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* add one more test case for full coverage
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* ci check fix - add const
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* add unit tests for json_diagnostic_postions only
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix ci_test_diagnostics
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
* fix ci_test_build_documentation check
Signed-off-by: Harinath Nampally <harinath922@gmail.com>
---------
Signed-off-by: Harinath Nampally <harinath922@gmail.com> 1 个月前
更多推荐
所有评论(0)