网址

python cookbook: https://python3-cookbook.readthedocs.io/zh_CN/latest/

库

需要用到html2markdown库，先安装

1	pipenv install html2text

beautifulSoup4解析页面

1	pipenv install beautifulsoup4

爬取

F12分析页面结构, 找到一级的目录

1	trees = bs.find_all("li", class_="toctree-l1")

1	a = tree.find("a")

拼接文章的完整url

1	url = base + "/" + href

获取页面树，并通过bs4解析

1
2
3

content = res.content
bs = BeautifulSoup(content, 'html.parser')
body = bs.find(class_="document")

循环保存为md格式的文档

1
2
3

m = toMarkDown.ToMD(url, new_save_path, "/{0}.md".format(title.replace("/", "")))
markdown = m.to_md(body.prettify())
m.save_file(markdown)

使用scrapy回调

start_urls = ['xxx']  
  
def parse(self, response):  
	lis = response.xpath('//li[@class="toctree-l1"]')  
	for index, li in enumerate(lis):  
		t = str(index+1).rjust(2, '0')  
		yield {  
			"title": t + '-' + li.css("a::text").get(),  
			"url": li.css("a::attr(href)").get(),  
		}