JAVA爬虫 - 爬取豆瓣租房信息

柔情只为你懂 2021-12-17 12:04 444阅读 0赞

最近打算换房子，豆瓣上面的租房小组相对来说较为真实，但是发现搜索功能不是那么友好，所以想把帖子的数据都爬到数据库，自己写sql语句去筛选，开搞！

每步过程都贴上完整代码，感兴趣的可以看下过程，没时间的同学直接拉到最下复制最终的代码去试试看也OK。

# 一、获取每页的url #

首先分析URL的规律。

链接：[龙岗租房小组][Link 1]

第一页：![20190708173236504.png][]

第二页：![2019070817333217.png][]

很容易发现参数 **start **代表的是每页帖子条数的开始，每页显示25行。我们就可以同过这个参数，写一个循环，每次增加25来获取每一页的内容。

注意：循环内需要try捕捉异常，这里设置的连接超时时间为10秒，如果因为网络的原因超时，会抛出异常，中断循环。使用try块后顶多这一页的数据不要了。如果想完整的收集数据，可以在catch块让 **pageStrat **\- 25 然后进入下个循环再次访问这页。

package douban;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.sql.SQLException;
    
    public class Main2 {
    	public static String DOU_BAN_URL = "https://www.douban.com/group/longgangzufang/discussion?start={pageStart}";
    	public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {
    		int pageStrat = 0;
    		while(true) {
    			try {
    		    URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));
    		    System.out.println("当前页面：" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));
    		    HttpURLConnection connection = (HttpURLConnection)url.openConnection();
    	            //设置请求方式
    	            connection.setRequestMethod("GET");
    	            // 10秒超时
    	            connection.setConnectTimeout(10000);
    	            connection.setReadTimeout(10000);
    	            //连接
    	            connection.connect();
    	            //得到响应码
    	            int responseCode = connection.getResponseCode();
    	            if(responseCode == HttpURLConnection.HTTP_OK){
    	                //得到响应流
    	                InputStream inputStream = connection.getInputStream();
    	                //获取响应
    	                BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
    	                String returnStr = "";
    	                String line;
    	                while ((line = reader.readLine()) != null){
    	                    returnStr+=line + "\r\n";
    	                }
    	                //该干的都干完了,记得把连接断了
    	                reader.close();
    	                inputStream.close();
    	                connection.disconnect();
    	                System.out.println(returnStr);
    	            }
    			}catch(Exception e) {
    				e.printStackTrace();
    			}
        	    pageStrat+=25;
    		}	
    	}
    }

运行程序，能得到每一页的整个html，存在变量 **returnStr** 中。

# 二、从每页的html中取得帖子详情的url #

接下来需要分析每个帖子详情的url，利用正则表达式提取

![20190708174532788.png][]

这里用到的是这两个类

import java.util.regex.Matcher;
    import java.util.regex.Pattern;

正则表达式这块就自己去学把。下面贴出获取到匹配到url的代码。

运行打印日志：

![watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70][]

# **三、进入帖子详情页，抓取文章标题和文章内容。** #

其实这步跟第一步的操作一样。循环访问帖子详情页就行了。然后分析帖子详情页的内容。

package douban;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.sql.SQLException;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Main2 {
    	public static String DOU_BAN_URL = "https://www.douban.com/group/longgangzufang/discussion?start={pageStart}";
    	public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {
    		
    		int pageStrat = 0;
    		while(true) {
    			try {
    				URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));
    				System.out.println("当前页面：" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));
    				HttpURLConnection connection = (HttpURLConnection)url.openConnection();
    	            //设置请求方式
    	            connection.setRequestMethod("GET");
    	            // 10秒超时
    	            connection.setConnectTimeout(10000);
    	            connection.setReadTimeout(10000);	            
    	            //连接
    	            connection.connect();
    	            //得到响应码
    	            int responseCode = connection.getResponseCode();
    
    	            if(responseCode == HttpURLConnection.HTTP_OK){
    	                //得到响应流
    	                InputStream inputStream = connection.getInputStream();
    	                //获取响应
    	                BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
    	                String returnStr = "";
    	                String line;
    	                while ((line = reader.readLine()) != null){
    	                    returnStr+=line + "\r\n";
    	                }
    	                //该干的都干完了,记得把连接断了
    	                reader.close();
    	                inputStream.close();
    	                connection.disconnect();
    //	                System.out.println(returnStr);
    	                Pattern p = Pattern.compile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");
    	        	    Matcher m = p.matcher(returnStr); 
    	        	    while(m.find()) {
    	            	    Thread.sleep(1000);
    	        	    	try {
    		        	    	String tempUrlStr = m.group(1);
    		        	    	System.out.println("		当前链接：" + tempUrlStr);
    		        	    	URL tempUrl = new URL(tempUrlStr);
    		        			HttpURLConnection tempConnection = (HttpURLConnection)tempUrl.openConnection();
    		                    //设置请求方式
    		        			tempConnection.setRequestMethod("GET");
    		        			// 10秒超时
    		        			tempConnection.setConnectTimeout(10000);
    		        			tempConnection.setReadTimeout(10000);
    		                    //连接
    		        			tempConnection.connect();
    		                    //得到响应码
    		                    int tempResponseCode = tempConnection.getResponseCode();
    
    		                    if(tempResponseCode == HttpURLConnection.HTTP_OK){
    		                        //得到响应流
    		                        InputStream tempInputStream = tempConnection.getInputStream();
    		                        //获取响应
    		                        BufferedReader tempReader = new BufferedReader(new InputStreamReader(tempInputStream));
    		                        String tempReturnStr = "";
    		                        String tempLine;
    		                        while ((tempLine = tempReader.readLine()) != null){
    		                        	tempReturnStr += tempLine + "\r\n";
    		                        }
    		                        Pattern p2 = Pattern.compile("\"text\": \"([^\"]*)\",\r\n" + 
    		                        		"	\"name\": \"([^\"]*)\",\r\n" + 
    		                        		"	\"url\": \"([^\"]*)\",\r\n" + 
    		                        		"  \"commentCount\": \"[^\"]*\",\r\n" + 
    		                        		"  \"dateCreated\": \"([^\"]*)\",");
    		                	    Matcher m2 = p2.matcher(tempReturnStr); 
    		                	    while(m2.find()) {
    		                	    	System.out.println(m2.group(1));
    		                	    	System.out.println(m2.group(2));
    		                	    	System.out.println(m2.group(3));
    		                	    	System.out.println(m2.group(4));
    		                	    }
    			                    tempReader.close();
    			                    tempInputStream.close();
    			                    tempConnection.disconnect();
    		                    }
    	        	    		
    	        	    	}catch(Exception e) {
    	        	    		e.printStackTrace();
    	        	    	}
    	        	    }
    	        	    System.out.println("换页");
    	            }
    			}catch(Exception e) {
    				e.printStackTrace();
    			}
        	    pageStrat+=25;
    		}	
    	}
    }

分析这块的html代码可以看到

![watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 1][]

嗯，写正则！但是在看html的过程中，发现了有段JS代码直接包含信息了

![watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 2][]

开心，相比前面隔得比较远，这里的数据就相对来说集中写，正则也比较好写，于是就用了上面代码贴出来的正则去匹配了

打印一下：

![watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 3][]

基本上到这就算OK了。接下来只要把标题、内容、url、发帖时间存入数据库就完成了。

# 四、反爬虫！！！ #

跑起来后，原本以为没反爬虫，但是发现爬了700多条后，IP被禁了。然后停了两天，解封后再次尝试。

在请求中**Request Headers**加入cookie，Host，Referer，User-Agent参数，这些参数可以直接打开豆瓣页面取。如果还会被封，那就登陆后再复制页面中的cookie。

![watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 4][]

因为有的网站反爬虫策略是根据这几个请求头内的属性判断的，请求头内的几个参数最好多了解了解，有助于绕过网站的反爬虫策略。

另外，可以在循环中设置停顿，**Thread.sleep(1000);**

因为有的反爬虫策略是限制一段时间内的请求数量，同时停顿也可以防止并发过大给目标网站带来压力，毕竟咱们是取数据而不是攻击目标。平时做爬虫也最好是放在半夜跑，以免跟正常用户抢占资源啦。

# 五、完整代码 #

这里贴上最后的版本，加上了绕过反爬虫的请求头参数，用jdbc连接数据库持久化数据。

我这边只存了标题，正文，原文URL，发布时间这四个字段。

package douban;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.SQLException;
    import java.sql.Statement;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Main {
    	public static String DOU_BAN_URL = "https://www.douban.com/group/longgangzufang/discussion?start={pageStart}";
    	public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, InterruptedException {
    		
    		Class.forName("com.mysql.jdbc.Driver");
    		String sqlUrl = "jdbc:mysql://localhost:3306/douban?characterEncoding=UTF-8";
    		Connection conn = DriverManager.getConnection(sqlUrl, "root", "VisionKi");
    		Statement stat = conn.createStatement();
    		
    		int pageStrat = 0;
    		while(true) {
    			try {
    				URL url = new URL(DOU_BAN_URL.replace("{pageStart}",pageStrat+""));
    				System.out.println("当前页面：" + DOU_BAN_URL.replace("{pageStart}",pageStrat+""));
    				HttpURLConnection connection = (HttpURLConnection)url.openConnection();
    	            //设置请求方式
    	            connection.setRequestMethod("GET");
    	            // 10秒超时
    	            connection.setConnectTimeout(10000);
    	            connection.setReadTimeout(10000);
    	            
    	            connection.setRequestProperty("Cookie", "这里按照前面说的登陆后F12拿到cookie放在这里");
    	            connection.setRequestProperty("Host", "www.douban.com");
    	            connection.setRequestProperty("Referer", "https://www.douban.com/group/longgangzufang/discussion?start=25");
    	            connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
    	            
    	            //连接
    	            connection.connect();
    
    
    	            //得到响应码
    	            int responseCode = connection.getResponseCode();
    
    	            if(responseCode == HttpURLConnection.HTTP_OK){
    	                //得到响应流
    	                InputStream inputStream = connection.getInputStream();
    	                //获取响应
    	                BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
    	                String returnStr = "";
    	                String line;
    	                while ((line = reader.readLine()) != null){
    	                    returnStr+=line + "\r\n";
    	                }
    	                //该干的都干完了,记得把连接断了
    	                reader.close();
    	                inputStream.close();
    	                connection.disconnect();
    //	                System.out.println(returnStr);
    	                Pattern p = Pattern.compile("<a href=\"([^\"]*)\" title=\"[^\"]*\" class=\"\">[^\"]*</a>");
    	        	    Matcher m = p.matcher(returnStr); 
    	        	    while(m.find()) {
    	            	    Thread.sleep(500);
    	        	    	try {
    		        	    	String tempUrlStr = m.group(1);
    		        	    	System.out.println("		当前链接：" + tempUrlStr);
    		        	    	URL tempUrl = new URL(tempUrlStr);
    		        			HttpURLConnection tempConnection = (HttpURLConnection)tempUrl.openConnection();
    		                    //设置请求方式
    		        			tempConnection.setRequestMethod("GET");
    		        			// 10秒超时
    		        			tempConnection.setConnectTimeout(10000);
    		        			tempConnection.setReadTimeout(10000);
    		        			tempConnection.setRequestProperty("Cookie", "这里按照前面说的登陆后F12拿到cookie放在这里");
    		        			tempConnection.setRequestProperty("Host", "www.douban.com");
    		        			tempConnection.setRequestProperty("Referer", "https://www.douban.com/group/longgangzufang/discussion?start=25");
    		        			tempConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
    		                    //连接
    		        			tempConnection.connect();
    		                    //得到响应码
    		                    int tempResponseCode = tempConnection.getResponseCode();
    
    		                    if(tempResponseCode == HttpURLConnection.HTTP_OK){
    		                        //得到响应流
    		                        InputStream tempInputStream = tempConnection.getInputStream();
    		                        //获取响应
    		                        BufferedReader tempReader = new BufferedReader(new InputStreamReader(tempInputStream));
    		                        String tempReturnStr = "";
    		                        String tempLine;
    		                        while ((tempLine = tempReader.readLine()) != null){
    		                        	tempReturnStr += tempLine + "\r\n";
    		                        }
    		                        
    
    		                        Pattern p2 = Pattern.compile("\"text\": \"([^\"]*)\",\r\n" + 
    		                        		"	\"name\": \"([^\"]*)\",\r\n" + 
    		                        		"	\"url\": \"([^\"]*)\",\r\n" + 
    		                        		"  \"commentCount\": \"[^\"]*\",\r\n" + 
    		                        		"  \"dateCreated\": \"([^\"]*)\",");
    		                	    Matcher m2 = p2.matcher(tempReturnStr); 
    		                	    while(m2.find()) {
    //		                	    	System.out.println(m2.group(1));
    //		                	    	System.out.println(m2.group(2));
    //		                	    	System.out.println(m2.group(3));
    //		                	    	System.out.println(m2.group(4));
    		                	    	stat.executeUpdate("INSERT INTO house(title,content,time,house_url) VALUES ('" + m2.group(2).replaceAll("[\\x{10000}-\\x{10FFFF}]", "") + "','" + m2.group(1).replaceAll("[\\x{10000}-\\x{10FFFF}]", "") + "','" + m2.group(4).replace("T"," ") + "','" + m2.group(3) + "');");
    		                	    }
    			                    tempReader.close();
    			                    tempInputStream.close();
    			                    tempConnection.disconnect();
    		                    }
    	        	    		
    	        	    	}catch(Exception e) {
    	        	    		e.printStackTrace();
    	        	    	}
    	        	    }
    	        	    System.out.println("换页");
    	            }
    			}catch(Exception e) {
    				e.printStackTrace();
    			}
        	    pageStrat+=25;
    		}	
    	}
    }

![2019070909115424.png][]

爬了一会4000多条了，没有再被禁ip。

[Link 1]: https://www.douban.com/group/longgangzufang/discussion?start=0
[20190708173236504.png]: /images/20211213/1d8d5b0d11b74b92aec9afc84be338b0.png
[2019070817333217.png]: /images/20211213/156f43b0fddf4f3da6a6bca44bd2f9ab.png
[20190708174532788.png]: /images/20211213/c0962cf37e06426b815b064c491b884e.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70]: /images/20211213/6c15ead4f7184d6b81f3522d9a4e0781.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 1]: /images/20211213/6e08dde777434f8abf193de576e07bab.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 2]: /images/20211213/88ff743b1ed542619f72ff17454733ff.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 3]: /images/20211213/ea88133dd8ba4980b6c446f7f1a19f81.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjIxMTYwMQ_size_16_color_FFFFFF_t_70 4]: /images/20211213/d90721ba5d9044378fc7a61563a1b65b.png
[2019070909115424.png]: /images/20211213/1fc0b5b6e8474813b6d7cdccc0f9d6b7.png