爬取技术文本存为markdown
Whisper Lv4

网址

python cookbook: https://python3-cookbook.readthedocs.io/zh_CN/latest/

需要用到html2markdown库,先安装

1
pipenv install html2text

beautifulSoup4解析页面

1
pipenv install beautifulsoup4

爬取

F12分析页面结构, 找到一级的目录

1
trees = bs.find_all("li", class_="toctree-l1")

遍历一级目录,获取到一级目录下所有的链接

1
a = tree.find("a")

拼接文章的完整url

1
url = base + "/" + href

获取页面树,并通过bs4解析

1
2
3
content = res.content
bs = BeautifulSoup(content, 'html.parser')
body = bs.find(class_="document")

循环保存为md格式的文档

1
2
3
m = toMarkDown.ToMD(url, new_save_path, "/{0}.md".format(title.replace("/", "")))
markdown = m.to_md(body.prettify())
m.save_file(markdown)

使用scrapy回调

1
2
3
4
5
6
7
8
9
10
start_urls = ['xxx']  

def parse(self, response):
lis = response.xpath('//li[@class="toctree-l1"]')
for index, li in enumerate(lis):
t = str(index+1).rjust(2, '0')
yield {
"title": t + '-' + li.css("a::text").get(),
"url": li.css("a::attr(href)").get(),
}