需求分析

并不是整站所有数据都爬取

https://news.cnblogs.com/ : 为静态数据(像这样的静态数据不多了)

url分析: https://news.cnblogs.com/n/page/2/ 点击next按钮为通用型策略

快速启动scrapy

# scrapy crawl SpiderName
scarpy crawl cnblogs

用脚本启动scrapy, 在根目录下创建main.py:

.
├── Article
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-38.pyc
│   │   └── settings.cpython-38.pyc
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   ├── __init__.cpython-38.pyc
│       │   └── cnblogs.cpython-38.pyc
│       └── cnblogs.py
├── main.py
└── scrapy.cfg
from scrapy.cmdline import execute

import sys
import os

# 把某个path添加到python的搜索目录中
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

execute(["scrapy","crawl","cnblogs"])

在cnblogs.py打个断点,在main.py脚本文件中ctrol右键单击debug运行,就会发现,成功运行了spider

results matching ""

    No results matching ""