scrapy shell终端请求调试

以博客园详情页为例

#先进入虚拟环境
workon py3scrapy
# scrapy shell 你想要调试的url
scrapy shell https://news.cnblogs.com/n/647662/

接着可以看到以下信息,200状态码表示http连接正常

2019-11-17 16:22:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-17 16:22:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-17 16:22:42 [scrapy.core.engine] INFO: Spider opened
2019-11-17 16:22:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.cnblogs.com/n/647662/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x109b7cf98>
[s]   item       {}
[s]   request    <GET https://news.cnblogs.com/n/647662/>
[s]   response   <200 https://news.cnblogs.com/n/647662/>
[s]   settings   <scrapy.settings.Settings object at 0x10aaa16d8>
[s]   spider     <DefaultSpider 'default' at 0x10af16438>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>response.css("#news_title>a::text").extract_first("")
'今日头条不急IPO'
>>>

上面的response和我们之前用的response是一样的.

#终端输入,即可拿到他title
response.css("#news_title>a::text").extract_first("")

安装requests

#升级pip3
pip3 install --upgrade pip
#查看pip安装包
pip freeze | grep requests
#安装request
pip3 install requests

明明安装上了,但是在终端输入

import requests

会报错

>>> import requests
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'requests'
>>>

具体解决办法在这里:https://blog.csdn.net/jusang486/article/details/82662423

把模块路径放到环境变量中作为全局变量(sys.path能扫描到)。

添加 PYTHONPATH = /path/to/your/module

具体path可参考常用技巧中的:mac下查找python包存放路径site-packages

这里是:/Users/scottxiong/.virtualenvs/py3scrapy/lib/python3.6/site-packages

当然也可以这样: 在运行脚本中显示的添加模块路径(推荐,灵活性更好)

import sys

# 1.表示导入当前文件的上层目录到搜索路径中
sys.path.append('..')

# 2.绝对路径
sys.path.append('/project/model')

from XX import XXX

我是采用第一种方法解决的

用request和内置的json包提取接口数据

直接用下面的代码在终端调试

import requests
response = requests.get("api接口")
response.text

eg:

>>> import requests
>>> response = requests.get("https://news.cnblogs.com/NewsAjax/GetAjaxNewsInfo?contentId=647662")
>>> response
<Response [200]>
>>> response.text
'{"ContentID":647662,"CommentCount":0,"TotalView":264,"DiggCount":0,"BuryCount":0}'
>>> json.loads(response.text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'json' is not defined
>>> import json
>>> json.loads(response.text)
{'ContentID': 647662, 'CommentCount': 0, 'TotalView': 264, 'DiggCount': 0, 'BuryCount': 0}
>>> json_data = json.loads(response.text)
>>> json_data['TotalView']
264
>>>

json包的loads方法可以将字符串转化为json object

results matching ""

    No results matching ""