一些网页通过加载Js来保护页面元素,当我们突破Js得到本地页面时,可以使用BS4库对页面进行分析,提取对应的元素来综合有价值的内容。
示例代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 发送请求并获取网页内容
url = 'your_local_or_online_page_url'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 定义一个空的列表来存储提取的数据
data = []
# 遍历页面中的项目列表,假设项目数据都在某个父元素中
projects = soup.find_all('tr', class_='project-id') # 根据实际情况修改选择器
# 提取每个项目的各项数据
for project in projects:
# 获取项目ID
project_id = project.find('td', class_='project-id-class').get_text(strip=True) # 修改为实际选择器
# 将提取的数据添加到列表中
data.append([project_id]) # 按实际修改
# 创建 DataFrame 并保存为 Excel
df = pd.DataFrame(data, columns=['ID']) # 按实际修改
df.to_excel('projects_data.xlsx', index=False)
print("Data has been successfully extracted and saved to 'projects_data.xlsx'.")
主要用到了BS4库。
示意代码:
from bs4 import BeautifulSoup # 假设有一个HTML文档 html_doc = """ <html> <head><title>Example Page</title></head> <body> <p class="title"><b>Sample Page</b></p> <p class="story">This is a test story. <a href="http://example.com/1" class="link">link1</a> <a href="http://example.com/2" class="link">link2</a></p> <p class="story">Another test story.</p> </body> </html> """ # 使用 BeautifulSoup 解析 HTML 文档 soup = BeautifulSoup(html_doc, 'html.parser') # 提取<title>标签的内容 title = soup.title.string print(f"Title: {title}") # 提取所有的链接(<a> 标签) links = soup.find_all('a') for link in links: print(f"Link text: {link.string}, URL: {link['href']}") # 查找特定类的<p>标签 story_paragraphs = soup.find_all('p', class_='story') for p in story_paragraphs: print(f"Story paragraph: {p.get_text()}")