python 爬取了租房数据 谁借莪1个温暖的怀抱¢ 2022-05-14 07:42 340阅读 0赞 爬取链接:[https://sh.lianjia.com/zufang/][https_sh.lianjia.com_zufang] 代码如下: import requests # 用于解析html数据的框架 from bs4 import BeautifulSoup # 用于操作excel的框架 from xlwt import * import json # 创建一个工作 book = Workbook(encoding='utf-8'); # 向表格中增加一个sheet表,sheet1为表格名称 允许单元格覆盖 sheet = book.add_sheet('sheet1', cell_overwrite_ok=True) # 设置样式 style = XFStyle(); pattern = Pattern(); pattern.pattern = Pattern.SOLID_PATTERN; pattern.pattern_fore_colour="0x00"; style.pattern = pattern; # 设置列标题 sheet.write(0, 0, "标题") sheet.write(0, 1, "地址") sheet.write(0, 2, "价格") sheet.write(0, 3, "建筑年代") sheet.write(0, 4, "满年限") sheet.write(0, 5, "离地铁") # 设置列宽度 sheet.col(0).width = 0x0d00 + 200*50 sheet.col(1).width = 0x0d00 + 20*50 sheet.col(2).width = 0x0d00 + 10*50 sheet.col(3).width = 0x0d00 + 120*50 sheet.col(4).width = 0x0d00 + 1*50 sheet.col(5).width = 0x0d00 + 50*50 # 指定爬虫所需的上海各个区域名称 citys = ['pudong', 'minhang', 'baoshan', 'xuhui', 'putuo', 'yangpu', 'changning', 'songjiang', 'jiading', 'huangpu', 'jinan', 'zhabei', 'hongkou', 'qingpu', 'fengxian', 'jinshan', 'chongming', 'shanghaizhoubian'] def getHtml(city): url = 'http://sh.lianjia.com/ershoufang/%s/' % city headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } request = requests.get(url=url, headers=headers) # 获取源码内容比request.text好,对编码方式优化好 respons = request.content # 使用bs4模块,对响应的链接源代码进行html解析,后面是python内嵌的解释器,也可以安装使用lxml解析器 soup = BeautifulSoup(respons, 'html.parser') # 获取类名为c-pagination的div标签,是一个列表 pageDiv = soup.select('div .page-box')[0] pageData =dict(pageDiv.contents[0].attrs)['page-data']; pageDataObj =json.loads(pageData); totalPage =pageDataObj['totalPage'] curPage =pageDataObj['curPage']; print(pageData); # 如果标签a标签数大于1,说明多页,取出最后的一个页码,也就是总页数 for i in range(totalPage): pageIndex=i+1; print(city+"=========================================第 " + str(pageIndex) + " 页") print("\n") saveData(city, url, pageIndex); # 调用方法解析每页数据,并且保存到表格中 def saveData(city, url, pageIndex): headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } urlStr ='%spg%s' % (url, pageIndex); print(urlStr); html = requests.get(urlStr, headers=headers).content; soup = BeautifulSoup(html, 'lxml') liList = soup.findAll("li", {"class": "clear LOGCLICKDATA"}) print(len(liList)); index=0; for info in liList: title =info.find("div",class_="title").find("a").text; address =info.find("div",class_="address").find("a").text flood = info.find("div", class_="flood").text subway = info.find("div", class_="tag").findAll("span", {"class", "subway"}); subway_col=""; if len(subway) > 0: subway_col = subway[0].text; taxfree = info.find("div", class_="tag").findAll("span", {"class", "taxfree"}); taxfree_col=""; if len(taxfree) > 0: taxfree_col = taxfree[0].text; priceInfo =info.find("div",class_="priceInfo").find("div",class_="totalPrice").text; print(flood); global row sheet.write(row, 0, title) sheet.write(row, 1, address) sheet.write(row, 2, priceInfo) sheet.write(row, 3, flood) sheet.write(row, 4,taxfree_col) sheet.write(row, 5,subway_col) row+=1; index=row; # 判断当前运行的脚本是否是该脚本,如果是则执行 # 如果有文件xxx继承该文件或导入该文件,那么运行xxx脚本的时候,这段代码将不会执行 if __name__ == '__main__': # getHtml('jinshan') row=1 for i in citys: getHtml(i) # 最后执行完了保存表格,参数为要保存的路径和文件名,如果不写路径则默然当前路径 book.save('lianjia-shanghai.xls') 如下图: ![70][] 思路是: * 先爬取每个区域的 url 和名称,跟主 url 拼接成一个完整的 url,循环 url 列表,依次爬取每个区域的租房信息。 * 在爬每个区域的租房信息时,找到最大的页码,遍历页码,依次爬取每一页的二手房信息。 post 代码之前,先简单讲一下这里用到的几个爬虫 Python 包: * requests:是用来请求对链家网进行访问的包。 * lxml:解析网页,用 Xpath 表达式与正则表达式一起来获取网页信息,相比 bs4 速度更快。 代码如下: import requests import time import re from lxml import etree # 获取某市区域的所有链接 def get_areas(url): print('start grabing areas') headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'} resposne = requests.get(url, headers=headers) content = etree.HTML(resposne.text) areas = content.xpath("//dd[@data-index = '0']//div[@class='option-list']/a/text()") areas_link = content.xpath("//dd[@data-index = '0']//div[@class='option-list']/a/@href") for i in range(1,len(areas)): area = areas[i] area_link = areas_link[i] link = 'https://bj.lianjia.com' + area_link print("开始抓取页面") get_pages(area, link) #通过获取某一区域的页数,来拼接某一页的链接 def get_pages(area,area_link): headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'} resposne = requests.get(area_link, headers=headers) pages = int(re.findall("page-data=\'{\"totalPage\":(\d+),\"curPage\"", resposne.text)[0]) print("这个区域有" + str(pages) + "页") for page in range(1,pages+1): url = 'https://bj.lianjia.com/zufang/dongcheng/pg' + str(page) print("开始抓取" + str(page) +"的信息") get_house_info(area,url) #获取某一区域某一页的详细房租信息 def get_house_info(area, url): headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'} time.sleep(2) try: resposne = requests.get(url, headers=headers) content = etree.HTML(resposne.text) info=[] for i in range(30): title = content.xpath("//div[@class='where']/a/span/text()")[i] room_type = content.xpath("//div[@class='where']/span[1]/span/text()")[i] square = re.findall("(\d+)",content.xpath("//div[@class='where']/span[2]/text()")[i])[0] position = content.xpath("//div[@class='where']/span[3]/text()")[i].replace(" ", "") try: detail_place = re.findall("([\u4E00-\u9FA5]+)租房", content.xpath("//div[@class='other']/div/a/text()")[i])[0] except Exception as e: detail_place = "" floor =re.findall("([\u4E00-\u9FA5]+)\(", content.xpath("//div[@class='other']/div/text()[1]")[i])[0] total_floor = re.findall("(\d+)",content.xpath("//div[@class='other']/div/text()[1]")[i])[0] try: house_year = re.findall("(\d+)",content.xpath("//div[@class='other']/div/text()[2]")[i])[0] except Exception as e: house_year = "" price = content.xpath("//div[@class='col-3']/div/span/text()")[i] with open('链家北京租房.txt','a',encoding='utf-8') as f: f.write(area + ',' + title + ',' + room_type + ',' + square + ',' +position+ ','+ detail_place+','+floor+','+total_floor+','+price+','+house_year+'\n') print('writing work has done!continue the next page') except Exception as e: print( 'ooops! connecting error, retrying.....') time.sleep(20) return get_house_info(area, url) def main(): print('start!') url = 'https://bj.lianjia.com/zufang' get_areas(url) if __name__ == '__main__': main() 由于每个楼盘户型差别较大,区域位置比较分散,每个楼盘具体情况还需具体分析 代码: #北京路段_房屋均价分布图 detail_place = df.groupby(['detail_place']) house_com = detail_place['price'].agg(['mean','count']) house_com.reset_index(inplace=True) detail_place_main = house_com.sort_values('count',ascending=False)[0:20] attr = detail_place_main['detail_place'] v1 = detail_place_main['count'] v2 = detail_place_main['mean'] line = Line("北京主要路段房租均价") line.add("路段",attr,v2,is_stack=True,xaxis_rotate=30,yaxix_min=4.2, mark_point=['min','max'],xaxis_interval=0,line_color='lightblue', line_width=4,mark_point_textcolor='black',mark_point_color='lightblue', is_splitline_show=False) bar = Bar("北京主要路段房屋数量") bar.add("路段",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2, xaxis_interval=0,is_splitline_show=False) overlap = Overlap() overlap.add(bar) overlap.add(line,yaxis_index=1,is_add_yaxis=True) overlap.render('北京路段_房屋均价分布图.html') ![2e397c295f6581bb2c64545eab6e4922.jpg][] **面积&租金分布呈阶梯性** #房源价格区间分布图 price_info = df[['area', 'price']] #对价格分区 bins = [0,1000,1500,2000,2500,3000,4000,5000,6000,8000,10000] level = ['0-1000','1000-1500', '1500-2000', '2000-3000', '3000-4000', '4000-5000', '5000-6000', '6000-8000', '8000-1000','10000以上'] price_stage = pd.cut(price_info['price'], bins = bins,labels = level).value_counts().sort_index() attr = price_stage.index v1 = price_stage.values bar = Bar("价格区间&房源数量分布") bar.add("",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2, xaxis_interval=0,is_splitline_show=False) overlap = Overlap() overlap.add(bar) overlap.render('价格区间&房源数量分布.html') ![75cc585b225f05b6175c07c4d341ea5f.jpg][] #房屋面积分布 bins =[0,30,60,90,120,150,200,300,400,700] level = ['0-30', '30-60', '60-90', '90-120', '120-150', '150-200', '200-300','300-400','400+'] df['square_level'] = pd.cut(df['square'],bins = bins,labels = level) df_digit= df[['area', 'room_type', 'square', 'position', 'total_floor', 'floor', 'house_year', 'price', 'square_level']] s = df_digit['square_level'].value_counts() attr = s.index v1 = s.values pie = Pie("房屋面积分布",title_pos='center') pie.add( "", attr, v1, radius=[40, 75], label_text_color=None, is_label_show=True, legend_orient="vertical", legend_pos="left", ) overlap = Overlap() overlap.add(pie) overlap.render('房屋面积分布.html') #房屋面积&价位分布 bins =[0,30,60,90,120,150,200,300,400,700] level = ['0-30', '30-60', '60-90', '90-120', '120-150', '150-200', '200-300','300-400','400+'] df['square_level'] = pd.cut(df['square'],bins = bins,labels = level) df_digit= df[['area', 'room_type', 'square', 'position', 'total_floor', 'floor', 'house_year', 'price', 'square_level']] square = df_digit[['square_level','price']] prices = square.groupby('square_level').mean().reset_index() amount = square.groupby('square_level').count().reset_index() attr = prices['square_level'] v1 = prices['price'] pie = Bar("房屋面积&价位分布布") pie.add("", attr, v1, is_label_show=True) pie.render() bar = Bar("房屋面积&价位分布") bar.add("",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2, xaxis_interval=0,is_splitline_show=False) overlap = Overlap() overlap.add(bar) overlap.render('房屋面积&价位分布.html') ![02d134583a5fabf7b28c3a18e709ae85.jpg][] 摘录:[爬取了上万条租房数据,你还要不要北漂][Link 1] [https_sh.lianjia.com_zufang]: https://sh.lianjia.com/zufang/ [70]: /images/20220514/0ecaecd9bd5d4a918d91d495ac896c9f.png [2e397c295f6581bb2c64545eab6e4922.jpg]: /images/20220514/4d8f2c53fe81438b80db3ab61be5b607.png [75cc585b225f05b6175c07c4d341ea5f.jpg]: /images/20220514/447fc0285f6c433aafb42a97d4b71733.png [02d134583a5fabf7b28c3a18e709ae85.jpg]: /images/20220514/a3c4f67d639a4d659b25db833859b644.png [Link 1]: http://bigdata.51cto.com/art/201808/582085.htm
相关 python 用代理简单的爬取ganji网租房信息 python 用代理proxy 简单的爬取ganji网的租房信息。 环境: 1.python3.7 2.requests 请求模块(pip instal reque 红太狼/ 2023年06月16日 03:53/ 0 赞/ 9 阅读
相关 Python爬取58同城租房数据,破解字体加密 本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。 以下文章来源于CSDN,作者:TRHX • 鲍勃 刚接触Python的 落日映苍穹つ/ 2022年12月26日 07:29/ 0 赞/ 405 阅读
相关 Python爬取新闻网数据 前言 本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。 PS:如有需要Python学习资料的小伙伴可以加点击下方链接 柔情只为你懂/ 2022年12月23日 09:47/ 0 赞/ 280 阅读
相关 北上广深租房图鉴(上)| 小笨聪用python爬取链家租房数据 [微信公众号原文链接][Link 1] 电视剧《恋爱地图上海篇》有一句台词: > 上海很繁华,但没有一点真实的感觉。来这里工作两年了,没有一个朋友,没有一个爱人,我感觉我像 Myth丶恋晨/ 2022年10月01日 06:51/ 0 赞/ 191 阅读
相关 python 爬取了租房数据 爬取链接:[https://sh.lianjia.com/zufang/][https_sh.lianjia.com_zufang] 代码如下: import 谁借莪1个温暖的怀抱¢/ 2022年05月14日 07:42/ 0 赞/ 341 阅读
相关 利用python爬取贝壳网租房信息 最近准备换房子,在网站上寻找各种房源信息,看得眼花缭乱,于是想着能否将基本信息汇总起来便于查找,便用python将基本信息爬下来放到excel,这样一来就容易搜索了。 1. 红太狼/ 2021年12月22日 03:45/ 0 赞/ 581 阅读
相关 python爬取app数据 python环境部署 python环境部署这里不做叙述 MYSQL操作 安装MySQL驱动 由于MySQL服务器以独立的进程运行,并通过网络对外服务,所以,需 桃扇骨/ 2021年12月21日 13:15/ 0 赞/ 393 阅读
相关 JAVA爬虫 - 爬取豆瓣租房信息 最近打算换房子,豆瓣上面的租房小组相对来说较为真实,但是发现搜索功能不是那么友好,所以想把帖子的数据都爬到数据库,自己写sql语句去筛选,开搞! 每步过程都贴上完整代码,感兴 柔情只为你懂/ 2021年12月17日 12:04/ 0 赞/ 443 阅读
相关 python 爬取动态数据 按照:[https://dryscrape.readthedocs.io/en/latest/installation.html][https_dryscrape.readth 落日映苍穹つ/ 2021年11月11日 04:20/ 0 赞/ 519 阅读
还没有评论,来说两句吧...