Python 爬取必应(壁纸+搜索词)

青旅半醒 2023-02-19 12:29 216阅读 0赞

爬取必应壁纸

经常使用必应应该可以发现,其主页每天都会更新一张图片,这些图片很好看,希望每天能够下载收藏每张图片。具体请看这个网站:必应每日高清壁纸(https://bing.ioliu.cn/)

效果如下:

效果

代码如下:

  1. import requests
  2. import re
  3. import os
  4. headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
  5. def get_page(num):
  6. page_list = []
  7. for i in range(1, num+1):
  8. url = f'https://bing.ioliu.cn/?p={ i}'
  9. page_list.append(url)
  10. return page_list
  11. def get_html(url):
  12. r = requests.get(url, headers=headers)
  13. html = r.text
  14. return html
  15. def parse_html(html):
  16. pattern1 = re.compile(r'data-progressive.*?src="(.*?)"')
  17. pattern2 = re.compile(r'<h3>(.*?)</h3>')
  18. img_list = re.findall(pattern1, html)
  19. title_list = re.findall(pattern2, html)
  20. return img_list, title_list
  21. def download(path, img_list, title_list):
  22. for i in range(len(img_list)):
  23. img_url = img_list[i]
  24. title = title_list[i]
  25. img_url = img_url.replace('640', '1920').replace('480', '1080')
  26. pattern3 = re.compile(r'[()-/_]')
  27. title = re.sub(pattern3, '', title)
  28. print(f'正在爬取: { img_url}')
  29. img_floder = 'D:/图片/'+keyword
  30. if not os.path.exists(img_floder):
  31. os.makedirs(img_floder)
  32. with open(f'{ img_floder}/{ title}.jpg', 'wb') as f:
  33. img_content = requests.get(img_url).content
  34. f.write(img_content)
  35. # 将爬取失败的删除
  36. if os.path.getsize(img_path) < 50:
  37. os.remove(img_path)
  38. if __name__ == '__main__':
  39. num = 20
  40. keyword = '必应壁纸'
  41. path = 'D:/图片/'
  42. page_list = get_page(num)
  43. for page in page_list:
  44. html = get_html(page)
  45. img_list, title_list = parse_html(html)
  46. download(path, img_list, title_list)

根据搜索词爬取必应图片

这里需要注意: requests.get(url, headers=headers).text 会有很多 html 转义编码的字符,比如:引号变为&quot,会影响使用正则

解决方法:

  1. 正则中加入&quot
  2. 使用 etree.HTML 重新加载一下,再用 xpath 定位到此处

出现问题:

  1. 请求超时

    设置请求超时时间,防止长时间停留在同一个请求

    socket.setdefaulttimeout(10)

    1. requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.iutour.cn', port=80):
    2. Max retries exceeded with url: /uploadfile/bjzb/20141126124539763.jpg
    3. (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001A46192EC50>:
    4. Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。',))
  2. 需要验证证书

    requests.get(img_url, verify=False)

    1. requests.exceptions.SSLError: HTTPSConnectionPool(host='bbp.jp', port=443):
    2. Max retries exceeded with url: /wp-content/uploads/2016/05/2-20.jpg
    3. (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))

直接使用 try:catch

  1. import requests
  2. import re
  3. import os
  4. from lxml import etree
  5. headers = {
  6. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
  7. def get_page(num):
  8. img_list = []
  9. for i in range((num // 35) + 1):
  10. url = f'https://cn.bing.com/images/async?q={ keyword}&first={ i*35}&count=35&relp=35&scenario=ImageBasicHover&datsrc=I&layout=RowBased_Landscape&mmasync=1'
  11. r = requests.get(url, headers=headers)
  12. html = r.text
  13. html = etree.HTML(html)
  14. conda_list = html.xpath('//a[@class="iusc"]/@m')
  15. for j in conda_list:
  16. pattern = re.compile(r'"murl":"(.*?)"')
  17. img_url = re.findall(pattern, j)[0]
  18. img_list.append(img_url)
  19. return img_list
  20. def download(path, img_list):
  21. for i in range(len(img_list)):
  22. img_url = img_list[i]
  23. print(f'正在爬取: { img_url}')
  24. img_floder = 'D:/图片/'+keyword
  25. if not os.path.exists(img_floder):
  26. os.makedirs(img_floder)
  27. try:
  28. with open(f'{ img_floder}/{ i}.jpg', 'wb') as f:
  29. img_content = requests.get(img_url).content
  30. f.write(img_content)
  31. except:
  32. continue
  33. if __name__ == '__main__':
  34. num = 100
  35. keyword = '食品街'
  36. path = 'D:/图片/'
  37. img_list = get_page(num)
  38. download(path, img_list)

发表评论

表情:
评论列表 (有 0 条评论,216人围观)

还没有评论,来说两句吧...

相关阅读