四、网页源代码解析

获取到网页源代码后,需要对其进行解析以提取所需数据。HTMLDOM是解析网页的常用工具。

Sub GetSearchBox
 Dim objIE As Object
 Dim objDoc As Object
 Set objIE = CreateObject
 objIE.Visible = False
 objIE.Navigate \"http://www.baidu.com\"
 Do While objIE.Busy Or objIE.readyState 4
 DoEvents
 Loop
 Set objDoc = objIE.document
 Debug.Print objDoc.getElementById.Value
End Sub

五、网页自动化操作

在解析网页时,可能需要进行自动化操作,如填写表单、点击按钮等。

Sub SearchKeyword
 Dim objIE As Object
 Dim objDoc As Object
 Set objIE = CreateObject
 objIE.Visible = False
 objIE.Navigate \"http://www.baidu.com\"
 Do While objIE.Busy Or objIE.readyState 4
 DoEvents
 Loop
 Set objDoc = objIE.document
 objDoc.getElementById.Value = keyword
 objDoc.getElementById.Click
End Sub

六、网页数据提取

在获取网页源代码并解析后,可以提取所需数据。

Sub GetSearchResult
 Dim objIE As Object
 Dim objDoc As Object
 Dim objDivs As Object
 Dim objDiv As Object
 Dim objLinks As Object
 Dim objLink As Object
 Set objIE = CreateObject
 objIE.Visible = False
 objIE.Navigate \"http://www.baidu.com\"
 Do While objIE.Busy Or objIE.readyState 4
 DoEvents
 Loop
 Set objDoc = objIE.document
 Set objDivs = objDoc.getElementById.getElementsByClassName
 For Each objDiv In objDivs
 Debug.Print \"\" & objDiv.getElementsByTagName.innerText
 Set objLinks = objDiv.getElementsByTagName
 For Each objLink In objLinks
 If Left = \"http\" Then
 Debug.Print \"URL:\" & objLink.href
 End If
 Next objLink
 Debug.Print \"------------------------\"
 Next objDiv
End Sub

七、批量网页抓取

在实际应用中,可能需要对多个网页进行抓取。

Sub BatchDownload
 Dim objIE As Object
 Dim objDoc As Object
 Dim objDivs As Object
 Dim objDiv As Object
 Dim objLinks As Object
 Dim objLink As Object
 Dim i As Integer
 Set objIE = CreateObject
 objIE.Visible = False
 For i = 0 To 9
 objIE.Navigate \"http://www.baidu.com/s?wd=VBA&pn=\" & i * 10
 Do While objIE.Busy Or objIE.readyState 4
 DoEvents
 Loop
 Set objDoc = objIE.document
 Set objDivs = objDoc.getElementById.getElementsByClassName
 For Each objDiv In objDivs
 Set objLinks = objDiv.getElementsByTagName
 For Each objLink In objLinks
 If Left = \"http\" Then
 Call DownloadPage
 Exit For
 End If
 Next objLink
 Next objDiv
 Next i
End Sub
Sub DownloadPage
 Dim httpReq As Object, fsObj As Object, tsObj As Object
 Dim strHTML As String, strPath As String, strFileName As String, strContent As String, iFileNum As Integer
 Set httpReq = CreateObject
 httpReq.Open \"GET\", url, False
 httpReq.send \"\"
 strHTML = httpReq.responseText
 Set fsObj = CreateObject
 strPath = \"C:\\Temp\\\"
 strFileName = Replace, \"/\", \"-\"), \":\", \"\") & \".html\"
 Set tsObj = fsObj.OpenTextFile
 tsObj.Write strHTML
 tsObj.Close
End Sub

八、反爬虫处理

在进行网页抓取时,需要注意反爬虫处理。
验证码:当访问次数过多时,网站会要求用户输入验证码。
IP限制:当同一IP地址对某个页面进行访问次数过多时,网站会自动禁止访问。
User-Agent限制:当同一User-Agent对某个页面进行访问次数过多时,网站会自动禁止访问。
Referer限制:当请求来源不合法时,网站会自动禁止请求。