chromedriver通过network日志获取response.body

linux-dash

A beautiful web dashboard for Linux

项目地址：https://gitcode.com/gh_mirrors/li/linux-dash

免费下载资源

qq_31474513

9776人浏览 · 2020-11-24 16:49:54

qq_31474513 · 2020-11-24 16:49:54 发布

爬虫webdriver 技术分享

概述：本博文主要分享chromedriver日志获取response.body 方法和其他selenium简单使用。由于之前Linux Chromedriver chrome-browser安装环境复杂、问题较多加上以前站点反爬策略较为简单，很少用性能相对较弱的webdriver去投入到爬虫的项目中。现在由于webdriver开发环境越来越方便简单，加上站点反爬力度和JavaScript复杂程度越来越高。所以webdriver技术值得仔细研究并投入生产项目。

环境：python3,selenium, chromedriver

官方文档：

https://selenium-python.readthedocs.io/

https://sites.google.com/a/chromium.org/chromedriver/

https://chromedevtools.github.io/devtools-protocol/tot/Network/

第一章安装环境。

Windows和mac 相对简单。分享centos7 安装

安装selenium

pip3 install selenium

安装chrome-browser

wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
yum install ./google-chrome-stable_current_x86_64.rpm

安装完成后会显示当前版本。

我安装的版本是87.0.4280.66-1

安装chromedriver（千万与chrome-browser版本对应）

在 http://chromedriver.storage.googleapis.com/这个页面找对应版本的下载地址

wget http://chromedriver.storage.googleapis.com/87.0.4280.20/chromedriver_linux64.zip

解压unzip chromedriver_linux64.zip

mv chromedriver /usr/bin即可

第二章爬虫基本使用

爬虫常用操作：

1、设置UserAgent

2、频繁更换ip。文档和网上其他博客启动chromedriver 设置代理的方法很多，但是更换代理需要退出重新启动，极大浪费时间和系统资源。难以满足爬虫更换代理需求。所以咱们要知道频繁更换Chromedrive代理的方法。

Demo：

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities  import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--headless')
#设置user-agent
option.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:55.0) Gecko/20100101 Firefox/55.0')
caps = DesiredCapabilities.CHROME
driver = webdriver.Chrome(options=option,desired_capabilities=caps)
p = '222.185.28.38:32374'
proxy = Proxy(
    {
        'proxyType':ProxyType.MANUAL,
        'httpProxy':'{}'.format(p),
    }
    )
proxy.add_to_capabilities(caps)
driver.start_session(caps)
driver.get('http://httpbin.org/get')
print(driver.page_source)
#更换proxy后 重新执行proxy.add_to_capabilities(caps) 和 driver.start_session(caps)即可。不用重启driver更新代理成功。

由于是浏览器模式， headers等信息不必过多干预。换proxy和UA 是爬虫核心需求。其他基本操作详见官方文档https://python-selenium-zh.readthedocs.io/zh_CN/ 或者博客https://cloud.tencent.com/developer/article/1067129。

第三章 webdriver爬虫进阶使用

以前不喜欢使用webdriver还有一个原因，一些ajax请求，还有一些通过开发者模式可以找到API返回json格式的请求，通过page_source解析效率低且字段没有API全。所以我就开始想，既然浏览器开发者模式可以对network进行分析，webdriver应该也有相应的功能。

如果Chromedriver能对network进行监控，直接拦截response 拿到response.body.那样岂不是非常爽？

最开始想到的方案A：通过mitmproxy等网络中间人方式对请求进行监控和拦截。

但是方案A有以下几个弊端难以成气候。

1、webdriver需要设置mitmproxy服务为代理，爬虫正常代理逻辑就需要增加二次代理逻辑。增加很多成本和异常。难以实现流畅的更换代理逻辑。

2、webdriver 拦截的request、和response难以和浏览器url和相应操作关联，数据交互逻辑混乱。

3、webdriver需要安装mitm第三方公共证书。但是现在很多站点有ssl ping等反爬逻辑。网络中间人等方案难以突破反爬逻辑。导致webdriver请求失败。

方案A诸多问题，果断放弃。

最理想的方案是直接通过webdriver进行日志分析，拿到response。

然后就开始调研webdriver日志性能分析方法。经过一些列调研尝试，终于可以拿到webdriver的network日志了。但是结果有些失落。因为日志没有response.body。日志结构如下：

{
    "message":{
        "method":"Network.responseReceived",
        "params":{
            "frameId":"35489C704F5D95B9D01BD99C00BA3AFA",
            "loaderId":"BA3FE8EAD2BB30562343F3940AEB8CBC",
            "requestId":"BA3FE8EAD2BB30562343F3940AEB8CBC",
            "response":{
                "connectionId":12,
                "connectionReused":false,
                "encodedDataLength":230,
                "fromDiskCache":false,
                "fromPrefetchCache":false,
                "fromServiceWorker":false,
                "headers":{
                    "Access-Control-Allow-Credentials":"true",
                    "Access-Control-Allow-Origin":"*",
                    "Connection":"keep-alive",
                    "Content-Length":"571",
                    "Content-Type":"application/json",
                    "Date":"Tue, 24 Nov 2020 01:59:03 GMT",
                    "Server":"gunicorn/19.9.0"
                },
                "headersText":"HTTP/1.1 200 OK
Date: Tue, 24 Nov 2020 01:59:03 GMT
Content-Type: application/json
Content-Length: 571
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

",
                "mimeType":"application/json",
                "protocol":"http/1.1",
                "remoteIPAddress":"3.211.1.78",
                "remotePort":80,
                "requestHeaders":{
                    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                    "Accept-Encoding":"gzip, deflate",
                    "Accept-Language":"en-US",
                    "Connection":"keep-alive",
                    "Host":"httpbin.org",
                    "Upgrade-Insecure-Requests":"1",
                    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:55.0) Gecko/20100101 Firefox/55.0"
                },
                "requestHeadersText":"GET /get HTTP/1.1
Host: httpbin.org
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:55.0) Gecko/20100101 Firefox/55.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: en-US
",
                "responseTime":1606183143651.414,
                "securityState":"insecure",
                "status":200,
                "statusText":"OK",
                "timing":{
                    "connectEnd":282.386,
                    "connectStart":47.972,
                    "dnsEnd":47.972,
                    "dnsStart":46.748,
                    "proxyEnd":-1,
                    "proxyStart":-1,
                    "pushEnd":0,
                    "pushStart":0,
                    "receiveHeadersEnd":519.16,
                    "requestTime":43961290.624483,
                    "sendEnd":282.616,
                    "sendStart":282.547,
                    "sslEnd":-1,
                    "sslStart":-1,
                    "workerFetchStart":-1,
                    "workerReady":-1,
                    "workerRespondWithSettled":-1,
                    "workerStart":-1
                },
                "url":"http://httpbin.org/get"
            },
            "timestamp":43961291.145069,
            "type":"Document"
        }
    },
    "webview":"35489C704F5D95B9D01BD99C00BA3AFA"
}

内容很丰富，但是没有我们最想要的。
接下来，上重点。我这个小白都能想到的方法果然已经有人在做了。很庆幸找到了完美的解决办法。**
**Chrome devtool **的 Network.getResponseBody 方法，就是我们想要的。
https://chromedevtools.github.io/devtools-protocol/tot/Network/


Network.getResponseBody #

Returns content served for the given request.
parameters

requestId
    RequestId

    Identifier of the network request to get content for.

Return Object

body
    string

    Response body.
base64Encoded
    boolean

    True, if content was sent as base64.

具体思路：通过Chromedriver获取到network日志。network 的response日志类型Network.responseReceived，里面有个参数 requestId ， chrome devtool Network.getResponseBody方法可以通过requestId 获取到response。话不多说，上代码尝试！

import time
import json
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities  import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType

caps = DesiredCapabilities.CHROME
caps['loggingPrefs'] = {
    'browser':'ALL',
    'performance':'ALL',
}
caps['perfLoggingPrefs'] = {
    'enableNetwork' : True,
    'enablePage' : False,
    'enableTimeline' : False
    }

option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--headless')
option.add_argument("--disable-extensions")
option.add_argument("--allow-running-insecure-content")
option.add_argument("--ignore-certificate-errors")
option.add_argument("--disable-single-click-autofill")
option.add_argument("--disable-autofill-keyboard-accessory-view[8]")
option.add_argument("--disable-full-form-autofill-ios")
option.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:55.0) Gecko/20100101 Firefox/55.0')
option.add_experimental_option('w3c',False)
option.add_experimental_option('perfLoggingPrefs',{
    'enableNetwork':True,
    'enablePage':False,
})
driver = webdriver.Chrome(options=option,desired_capabilities=caps)
driver.get('http://httpbin.org/get')
for typelog in driver.log_types:
    perfs = driver.get_log(typelog)
    for row in perfs:
        log_data = row
        log_json = json.loads(log_data['message'])
        log = log_json['message']
        if log['method'] == 'Network.responseReceived':
            requestId = log['params']['requestId']
            try:
                response_body = driver.execute_cdp_cmd('Network.getResponseBody',{'requestId': requestId})
                print(response_body['body'])

            except:
                print('response.body is null')

至此，监控Chromedriver network 完美解决！你可以使用Chromedriver随心所欲更换代理，可以设置UserAgent，官方有非常多完善的日常使用浏览器的操作方法，并且现在你可以控制webdriver的network直接解析。相信我们使用webdriver开发爬虫会如虎添翼。

第四章 webdriver和requests 灵活切换。

掌握以上一些技巧之后虽然还算可以使用，但是还是效率问题！webdriver固然可以满足需求，但是操作流程可能十分麻烦，举个场景，我们有个很长的列表页，需要对详情页挨个采集，详情页还需要模拟点击其他位置。这样的操作对webdriver简直是噩梦。请求的操作逻辑太长了。此时我们需要考虑的能封装request的还尽量使用代码请求为上策。我们可以选择携带cookies使用requests。也可以使用requestium模块transfer_driver_cookies_to_session()和transfer_session_cookies_to_driver()提升效率。话不多说，上代码：

#coding=utf-8
import time,sys,os
from requestium import Session, Keys
s = Session(webdriver_path='/Users/lib/Documents/chromedriver',
            browser='chrome',
            default_timeout=15,
            webdriver_options={'arguments': ['disable-gpu',"--user-agent=Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"]})
            #webdriver_options={'arguments': ['headless','--proxy-server=http://user:pwd@ip2.hahado.cn:40275']})
            #webdriver_options={'arguments': ['headless','--proxy-server=http://116.11.254.37:80']})
#url = 'https://icanhazip.com/'
url = 'https://www.che300.com/pinggu/v9c9m30614r2016-06g2.8'
url = 'https://magi.com/search?q=%E6%96%B0%E5%86%A0'
url = 'http://m.baidu.com/s?word=%E6%88%BF%E5%B1%B1%E6%96%B0%E7%9B%98'
s.driver.get(url)
html =  s.driver.page_source

print (html)
#pic = response.xpath('//img/@src').extract_first()
pic = None
#if pic is not None:
if u'访问太过于频繁,请输入验证码后再次访问' in html:

  #print pic
  code = raw_input('请输入验证码:')
  print ('您输入的验证码是：%s' % code)
  s.driver.find_element_by_xpath("//input[@name='code']").send_keys(code,Keys.ENTER)
  time.sleep(1)
  s.driver.ensure_element_by_xpath("//input[@type='submit']").click()
  #self.s.transfer_driver_cookies_to_session()

#s.transfer_session_cookies_to_driver()
time.sleep(3)
print (s.driver.page_source)
s.driver.quit()

踩过的坑：

1、获取日志过程中如果没有 option.add_experimental_option(‘w3c’,False) 会报错。
2、Network.getResponseBody 在抓取请求的body过程中，如果body为空程序报错

No resource with given identifier found

加异常处理。

3、Centos 服务器安装好 chrome 和 chromedriver后截图不能显示中文。原因：系统默认是英文字符集，没有中文语言和字体库。解决办法如下。
查看修改服务器字符集
[root@192-168-17-194 htms-op-ui-automation-test]# echo $LANG en_US.UTF-8
确认系统是否支持中文字符集
locale -a |grep CN
Centos7修改系统默认字符集
vim /etc/locale.conf # LANG=“en_US.UTF-8” LANG="zh_CN.UTF-8"
安装中文字符集支持
yum -y groupinstall “X Window System” yum -y groupinstall chinese-support # Centos7如果报错找不到chinese-support可忽略 yum -y groupinstall Fonts

联系作者：

QQ：739669518
爬虫技术交流群：259788518

GitHub 加速计划 / li / linux-dash

下载

A beautiful web dashboard for Linux

最近提交(Master分支：4 个月前 )

186a802e added ecosystem file for PM2 4 年前

5def40a3 Add host customization support for the NodeJS version 4 年前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m