毕业论文写作

毕业设计需求

计算机毕业设计中Python爬取HTML网页数据

作者：帮我毕业点击次数：0

软件环境

Mac 10.13.1 (17B1003)
Python 2.7.10
VSCode 1.18.1

摘要

本文是练手Demo，主要是使用 Beautiful Soup 来爬取网页数据。

Beautiful Soup 介绍

Beautiful Soup提供一些简单的、python式的用来处理导航、搜索、修改分析树等功能。

Beautiful Soup 官方中文文档

特点

简单：它是一个工具箱，通过解析文档为用户提供需要抓取的数据
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

Beautiful Soup 的安装

安装 pip (如果需要): sudo easy_install pip
安装 Beautiful Soup: sudo pip install beautifulsoup4

示例

本示例是抓取某公司的投资列表页面，页面如下图：

确定获取数据范围

本示例是获取项目列表，打开Chrome的调试栏，找到对应的位置，如下图：

导包

import sys
import json
import urllib2 as HttpUtils
import urllib as UrlUtils
from bs4 import BeautifulSoup
1
2
3
4
5

获取页面信息（分页）

def gethtml(page):
    '获取指定页码的网页数据'
    url = 'https://box.xxx.com/Project/List'
    values = {
        'category': '',
        'rate': '',
        'range': '',
        'page': page
    }
    data = UrlUtils.urlencode(values)
    # 使用 DebugLog
    httphandler = HttpUtils.HTTPHandler(debuglevel=1)
    httpshandler = HttpUtils.HTTPSHandler(debuglevel=1)
    opener = HttpUtils.build_opener(httphandler, httpshandler)
    HttpUtils.install_opener(opener)
    request = HttpUtils.Request(url + '?' + data)
    request.get_method = lambda: 'GET'
    try:
        response = HttpUtils.urlopen(request, timeout=10)
    except HttpUtils.URLError, err:
        if hasattr(err, 'code'):
            print err.code
        if hasattr(err, 'reason'):
            print err.reason
        return None
    else:
        print '====== Http request OK ======'
    return response.read().decode('utf-8')

TIPS

urlopen(url, data, timeout)
- url: 请求的 URL
- data: 访问 URL 时要传送的数据
- timeout: 超时时间
HttpUtils.build_opener(httphandler, httpshandler)
- 开启日志，将会在调试控制台输出网络请求日志，方便调试
必要的 try-catch，以便可以捕获到网络异常

解析获取的数据

创建BeautifulSoup对象

soup = BeautifulSoup(html, 'html.parser')

获取待遍历的对象

# items 是一个 <listiterator object at 0x10a4b9950> 对象，不是一个list，但是可以循环遍历所有子节点。
items = soup.find(attrs={'class':'row'}).children

遍历子节点，解析并获取所需参数

projectList = []
for item in items:
    if item == '\n': continue
    # 获取需要的数据
    title = item.find(attrs={'class': 'title'}).string.strip()
    projectId = item.find(attrs={'class': 'subtitle'}).string.strip()
    projectType = item.find(attrs={'class': 'invest-item-subtitle'}).span.string
    percent = item.find(attrs={'class': 'percent'})
    state = 'Open'
    if percent is None: # 融资已完成
        percent = '100%'
        state = 'Finished'
        totalAmount = item.find(attrs={'class': 'project-info'}).span.string.strip()
        investedAmount = totalAmount
    else:
        percent = percent.string.strip()
        state = 'Open'
        decimalList = item.find(attrs={'class': 'decimal-wrap'}).find_all(attrs={'class': 'decimal'})
        totalAmount =  decimalList[0].string
        investedAmount = decimalList[1].string
    investState = item.find(attrs={'class': 'invest-item-type'})
    if investState != None:
        state = investState.string
    profitSpan = item.find(attrs={'class': 'invest-item-rate'}).find(attrs={'class': 'invest-item-profit'})
    profit1 = profitSpan.next.strip()
    profit2 = profitSpan.em.string.strip()
    profit = profit1 + profit2
    term = item.find(attrs={'class': 'invest-item-maturity'}).find(attrs={'class': 'invest-item-profit'}).string.strip()
    project = {
        'title': title,
        'projectId': projectId,
        'type': projectType,
        'percent': percent,
        'totalAmount': totalAmount,
        'investedAmount': investedAmount,
        'profit': profit,
        'term': term,
        'state': state
    }
    projectList.append(project)

输出解析结果，如下：

TIPS

解析html代码，主要是运用了BeautifulSoup的几大对象，Tag、NavigableString、BeautifulSoup、Comment，可以参考Beautiful Soup 官方中文文档

毕业论文写作

毕业设计需求

计算机毕业设计中Python爬取HTML网页数据

软件环境

摘要

Beautiful Soup 介绍

Beautiful Soup 官方中文文档

特点

Beautiful Soup 的安装

示例

确定获取数据范围

导包

获取页面信息（分页）

TIPS

解析获取的数据

创建BeautifulSoup对象

获取待遍历的对象

遍历子节点，解析并获取所需参数

输出解析结果，如下：

TIPS

最新毕业设计成品

计算机毕业设计网站热销排行

帮我毕业网服务

QQ咨询

电话咨询