免费代理IP爬取
Mr.Seaning 博主

学习python爬虫时候的一个小案例,希望能帮助到大家

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import requests
import time
from bs4 import BeautifulSoup

# 代理IP的信息存储

def write_proxy(proxies):
print(proxies)
for proxy in proxies:
with open("ip_proxy.txt", 'a+') as f:
print("正在写入:", proxy)
f.write(proxy + '\n')
print("录入完成!!!")


# 解析网页,并得到网页中的代理IP
def get_proxy(html):
# 对获取的页面进行解析
proxies = []
bs = BeautifulSoup(html, "html.parser")
trs = bs.select("tbody>tr")
# 信息提取
for tr in trs:
# 获取IP地址
ip = tr.td.get_text()
# 获取端口
port = tr.td.next_sibling.get_text()
# 拼接IP地址,端口号
proxy = "http://"+ip + ":" + port
# 拼接的IP地址放入到定义的空列表中
proxies.append(proxy)
# 计算每个页面一共有几个IP地址
test_proxies(proxies)


# 验证已得到IP的可用性,本段代码通过访问百度网址,返回的response状态码判断(是否可用)。
def test_proxies(proxies):
proxies = proxies
url = "http://www.baidu.com/"
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74",
}
normal_proxies = []
count = 1
for proxy in proxies:
print("第%s个。。" % count)
count += 1
try:
response = requests.get(url, headers=header, proxies={
"http": proxy}, timeout=1)
if response.status_code == 200:
print("该代理IP可用:", proxy)
normal_proxies.append(proxy)
else:
print("该代理IP不可用:", proxy)
except Exception:
print("该代理IP无效:", proxy)
pass
# print(normal_proxies)
write_proxy(normal_proxies)


def get_html(url):
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74",
}
response = requests.get(url, headers=header)
response.encoding = response.apparent_encoding
get_proxy(response.text)


if __name__ == "__main__":
# 循环获取网址
base_url = "https://ip.jiangxianli.com/?page=%s&country=中国"
for i in range(1, 3):
url = base_url % i
get_html(url)
time.sleep(10) # 休眠10秒爬取第二页
  • 本文标题:免费代理IP爬取
  • 本文作者:Mr.Seaning
  • 创建时间:2021-03-11 11:08:02
  • 本文链接:https://www.seaning.com/73.html
  • 版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!