Python环境

配置python源

编辑 C:\Users\${用户}\pip\pip.ini 文件

[global]
index-url = http://mirrors.aliyun.com/pypi/simple/
[install]
trusted-host=mirrors.aliyun.com

安装 virtualenv

最好加上豆瓣源或者阿里源:-i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

pip install virtualenv 
pip install virtualenvwrapper 
pip install virtualenvwrapper-win

创建纯净的python环境

# 目标地址 C:\Users\Burna\Envs\python27
mkvirtualenv --python=C:\Software\Python\Python27\python.exe python27
# 目标地址 C:\Users\Burna\Envs\python39
mkvirtualenv --python=C:\Software\Python\Python39\python.exe python39
# 切换当前环境
workon python27
# 删除python环境
rmvirtualenv  py3.6

安装 scrapy

workon python39
pip install scrapy
# 查看已安装的包
pip list

scrapy入门

Scrapy 2.6 documentation

创建项目

C:\Burna\Workspace\Study
# 创建项目,不能以数字开头
scrapy startproject ArticleSpider
# 重命名目录名 ArticleSpider => 20220418_ArticleSpider
cd 20220418_ArticleSpider
# 创建一个新的spider
scrapy genspider cnblogs news.cnblogs.com
# 启动爬虫
scrapy crawl cnblogs
# 启动脚本(快速调试爬虫代码)
scrapy shell https://news.cnblogs.com/n/719025/

工程目录
image.png

生成代码

import scrapy


class CnblogsSpider(scrapy.Spider):
  name = 'cnblogs'
  allowed_domains = ['news.cnblogs.com']
  start_urls = ['http://news.cnblogs.com/']

  def parse(self, response):
    pass

编写启动脚本main.py:

import os
import sys

from scrapy.cmdline import execute

workdir = os.path.dirname(os.path.abspath(__file__))
print("运行目录 => " + workdir)
sys.path.append(workdir)
execute("scrapy crawl cnblogs".split(" "))