宁为玉碎,不为瓦全
分类: Python/Ruby
2025-02-20 16:49:53
动态延迟是指根据爬虫运行时的环境和API的响应情况,动态调整请求之间的间隔时间。与静态延迟(固定时间间隔)相比,动态延迟能够更灵活地应对API的限制策略,同时{BANNED}最佳大化爬虫的效率。动态延迟的重要性体现在以下几个方面:
在Java爬虫中,动态延迟可以通过以下几种策略实现:
API的响应时间可以作为动态延迟的重要参考。如果API响应时间较短,说明当前请求频率可能较低,可以适当减少延迟;如果响应时间较长,说明可能接近API的限制,需要增加延迟。
许多API在达到请求频率限制时会返回特定的错误码(如429 Too Many Requests)。爬虫可以根据这些错误码动态调整延迟。
滑动窗口算法是一种常用的流量控制算法,可以动态调整请求频率,确保在一定时间窗口内的请求次数不超过API的限制。
以下是基于API响应时间的动态延迟实现代码示例,同时结合了代理服务器的使用:
import java.io.IOException; import java.net.HttpURLConnection; import java.net.InetSocketAddress; import java.net.Proxy; import java.net.URL; import java.util.concurrent.TimeUnit; public class DynamicDelayCrawlerWithProxy { private static final String PROXY_HOST = ""; private static final int PROXY_PORT = 5445; private static final String PROXY_USER = "16QMSOML"; private static final String PROXY_PASS = "280651"; private static final int MIN_DELAY = 100; // {BANNED}最佳小延迟时间(毫秒) private static final int MAX_DELAY = 5000; // {BANNED}最佳大延迟时间(毫秒) private static final int TARGET_RESPONSE_TIME = 500; // 目标响应时间(毫秒) public static void main(String[] args) { String apiUrl = ""; int delay = MIN_DELAY; // 设置代理服务器 System.setProperty("java.net.useSystemProxies", "true"); System.setProperty("http.proxyHost", PROXY_HOST); System.setProperty("http.proxyPort", String.valueOf(PROXY_PORT)); System.setProperty("https.proxyHost", PROXY_HOST); System.setProperty("https.proxyPort", String.valueOf(PROXY_PORT)); // 设置代理认证 System.setProperty("java.net.useSystemProxies", "true"); System.setProperty("http.proxyUser", PROXY_USER); System.setProperty("http.proxyPassword", PROXY_PASS); System.setProperty("https.proxyUser", PROXY_USER); System.setProperty("https.proxyPassword", PROXY_PASS); while (true) { long startTime = System.currentTimeMillis(); try { // 发起请求 URL url = new URL(apiUrl); Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(PROXY_HOST, PROXY_PORT)); HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy); connection.setRequestMethod("GET"); connection.setConnectTimeout(5000); connection.setReadTimeout(5000); connection.connect(); int responseCode = connection.getResponseCode(); if (responseCode == 200) { // 请求成功,处理响应数据 System.out.println("Data fetched successfully."); } else { System.out.println("Failed to fetch data. Response Code: " + responseCode); } } catch (IOException e) { System.out.println("Error occurred: " + e.getMessage()); } long endTime = System.currentTimeMillis(); long responseTime = endTime - startTime; // 根据响应时间调整延迟 if (responseTime < TARGET_RESPONSE_TIME) { delay = Math.max(MIN_DELAY, delay - 100); // 减少延迟 } else { delay = Math.min(MAX_DELAY, delay + 100); // 增加延迟 } // 等待下一次请求 try { TimeUnit.MILLISECONDS.sleep(delay); } catch (InterruptedException e) { e.printStackTrace(); } } } }
当API返回429错误码时,说明请求频率过高。此时可以动态增加延迟,直到API恢复正常响应。以下是基于错误码的动态延迟实现代码示例:
import java.io.IOException; import java.net.HttpURLConnection; import java.net.InetSocketAddress; import java.net.Proxy; import java.net.URL; import java.util.concurrent.TimeUnit; public class ErrorBasedDynamicDelayCrawlerWithProxy { private static final String PROXY_HOST = ""; private static final int PROXY_PORT = 5445; private static final String PROXY_USER = "16QMSOML"; private static final String PROXY_PASS = "280651"; private static final int MIN_DELAY = 100; // {BANNED}最佳小延迟时间(毫秒) private static final int MAX_DELAY = 5000; // {BANNED}最佳大延迟时间(毫秒) private static final int INITIAL_DELAY = 500; // 初始延迟时间(毫秒) public static void main(String[] args) { String apiUrl = ""; int delay = INITIAL_DELAY; // 设置代理服务器 System.setProperty("java.net.useSystemProxies", "true"); System.setProperty("http.proxyHost", PROXY_HOST); System.setProperty("http.proxyPort", String.valueOf(PROXY_PORT)); System.setProperty("https.proxyHost", PROXY_HOST); System.setProperty("https.proxyPort", String.valueOf(PROXY_PORT)); // 设置代理认证 System.setProperty("java.net.useSystemProxies", "true"); System.setProperty("http.proxyUser", PROXY_USER); System.setProperty("http.proxyPassword", PROXY_PASS); System.setProperty("https.proxyUser", PROXY_USER); System.setProperty("https.proxyPassword", PROXY_PASS); while (true) { try { URL url = new URL(apiUrl); Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(PROXY_HOST, PROXY_PORT)); HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy); connection.setRequestMethod("GET"); connection.setConnectTimeout(5000); connection.setReadTimeout(5000); connection.connect(); int responseCode = connection.getResponseCode(); if (responseCode == 200) { // 请求成功,处理响应数据 System.out.println("Data fetched successfully."); delay = Math.max(MIN_DELAY, delay / 2); // 成功时减少延迟 } else if (responseCode == 429) { // 请求频率过高,增加延迟 System.out.println("Rate limit exceeded. Increasing delay."); delay = Math.min(MAX_DELAY, delay * 2); } else { System.out.println("Failed to fetch data. Response Code: " + responseCode); } } catch (IOException e) { System.out.println("Error occurred: " + e.getMessage()); } // 等待下一次请求 try { TimeUnit.MILLISECONDS.sleep(delay); } catch (InterruptedException e) { e.printStackTrace(); } } } }
滑动窗口算法是一种常用的流量控制算法,可以动态调整请求频率,确保在一定时间窗口内的请求次数不超过API的限制。以下是基于滑动窗口算法的动态延迟实现代码示例:
import java.io.IOException; import java.net.HttpURLConnection; import java.net.InetSocketAddress; import java.net.Proxy; import java.net.URL; import java.util.concurrent.ConcurrentLinkedQueue; import java.util.concurrent.TimeUnit; public class SlidingWindowCrawlerWithProxy { private static final String PROXY_HOST = ""; private static final int PROXY_PORT = 5445; private static final String PROXY_USER = "16QMSOML"; private static final String PROXY_PASS = "280651"; private static final int WINDOW_SIZE = 60000; // 时间窗口大小(毫秒) private static final int MAX_REQUESTS_PER_WINDOW = 100; // 每个时间窗口内的{BANNED}最佳大请求次数 private static final ConcurrentLinkedQueuerequestTimes = new ConcurrentLinkedQueue<>(); public static void main(String[] args) { String apiUrl = ""; // 设置代理服务器 System.setProperty("java.net.useSystemProxies", "true"); System.setProperty("http.proxyHost", PROXY_HOST); System.setProperty("http.proxyPort", String.valueOf(PROXY_PORT)); System.setProperty("https.proxyHost", PROXY_HOST); System.setProperty("https.proxyPort", String.valueOf(PROXY_PORT)); // 设置代理认证 System.setProperty("java.net.useSystemProxies", "true"); System.setProperty("http.proxyUser", PROXY_USER); System.setProperty("http.proxyPassword", PROXY_PASS); System.setProperty("https.proxyUser", PROXY_USER); System.setProperty("https.proxyPassword", PROXY_PASS); while (true) { // 清理超出时间窗口的请求记录 long currentTime = System.currentTimeMillis(); while (!requestTimes.isEmpty() && currentTime - requestTimes.peek() > WINDOW_SIZE) { requestTimes.poll(); } // 检查是否达到请求频率限制 if (requestTimes.size() >= MAX_REQUESTS_PER_WINDOW) { long delay = WINDOW_SIZE - (currentTime - requestTimes.peek()); System.out.println("Rate limit exceeded. Waiting for " + delay + " ms."); try { TimeUnit.MILLISECONDS.sleep(delay); } catch (InterruptedException e) { e.printStackTrace(); } } // 发起请求 try { URL url = new URL(apiUrl); Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(PROXY_HOST, PROXY_PORT)); HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy); connection.setRequestMethod("GET"); connection.setConnectTimeout(5000); connection.setReadTimeout(5000); connection.connect(); int responseCode = connection.getResponseCode(); if (responseCode == 200) { // 请求成功,处理响应数据 System.out.println("Data fetched successfully."); } else { System.out.println("Failed to fetch data. Response Code: " + responseCode); } } catch (IOException e) { System.out.println("Error occurred: " + e.getMessage()); } // 记录请求时间 requestTimes.add(System.currentTimeMillis()); } } }
在Java爬虫开发中,设置动态延迟是避免API限制的关键技术,而代理服务器的使用则进一步提高了爬虫的稳定性和安全性。通过基于API响应时间、错误码或滑动窗口算法的动态延迟策略,爬虫可以在不触发API限制的情况下,高效地抓取数据。