scrapy shell终端请求调试
以博客园详情页为例
#先进入虚拟环境
workon py3scrapy
# scrapy shell 你想要调试的url
scrapy shell https://news.cnblogs.com/n/647662/
接着可以看到以下信息,200状态码表示http连接正常
2019-11-17 16:22:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-17 16:22:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-17 16:22:42 [scrapy.core.engine] INFO: Spider opened
2019-11-17 16:22:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.cnblogs.com/n/647662/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x109b7cf98>
[s] item {}
[s] request <GET https://news.cnblogs.com/n/647662/>
[s] response <200 https://news.cnblogs.com/n/647662/>
[s] settings <scrapy.settings.Settings object at 0x10aaa16d8>
[s] spider <DefaultSpider 'default' at 0x10af16438>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>response.css("#news_title>a::text").extract_first("")
'今日头条不急IPO'
>>>
上面的response和我们之前用的response是一样的.
#终端输入,即可拿到他title
response.css("#news_title>a::text").extract_first("")
安装requests
#升级pip3
pip3 install --upgrade pip
#查看pip安装包
pip freeze | grep requests
#安装request
pip3 install requests
明明安装上了,但是在终端输入
import requests
会报错
>>> import requests
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'requests'
>>>
具体解决办法在这里:https://blog.csdn.net/jusang486/article/details/82662423
把模块路径放到环境变量中作为全局变量(sys.path能扫描到)。
添加 PYTHONPATH = /path/to/your/module
具体path可参考常用技巧中的:mac下查找python包存放路径site-packages
这里是:/Users/scottxiong/.virtualenvs/py3scrapy/lib/python3.6/site-packages
当然也可以这样: 在运行脚本中显示的添加模块路径(推荐,灵活性更好)
import sys
# 1.表示导入当前文件的上层目录到搜索路径中
sys.path.append('..')
# 2.绝对路径
sys.path.append('/project/model')
from XX import XXX
我是采用第一种方法解决的
用request和内置的json包提取接口数据
直接用下面的代码在终端调试
import requests
response = requests.get("api接口")
response.text
eg:
>>> import requests
>>> response = requests.get("https://news.cnblogs.com/NewsAjax/GetAjaxNewsInfo?contentId=647662")
>>> response
<Response [200]>
>>> response.text
'{"ContentID":647662,"CommentCount":0,"TotalView":264,"DiggCount":0,"BuryCount":0}'
>>> json.loads(response.text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'json' is not defined
>>> import json
>>> json.loads(response.text)
{'ContentID': 647662, 'CommentCount': 0, 'TotalView': 264, 'DiggCount': 0, 'BuryCount': 0}
>>> json_data = json.loads(response.text)
>>> json_data['TotalView']
264
>>>
json包的loads方法可以将字符串转化为json object