爬取技术文本存为markdown
网址
python cookbook: https://python3-cookbook.readthedocs.io/zh_CN/latest/
库
需要用到html2markdown库,先安装
1 | pipenv install html2text |
beautifulSoup4解析页面
1 | pipenv install beautifulsoup4 |
爬取
F12分析页面结构, 找到一级的目录
1 | trees = bs.find_all("li", class_="toctree-l1") |
遍历一级目录,获取到一级目录下所有的链接
1 | a = tree.find("a") |
拼接文章的完整url
1 | url = base + "/" + href |
获取页面树,并通过bs4解析
1 | content = res.content |
循环保存为md格式的文档
1 | m = toMarkDown.ToMD(url, new_save_path, "/{0}.md".format(title.replace("/", ""))) |
使用scrapy回调
1 | start_urls = ['xxx'] |