四、网页源代码解析
获取到网页源代码后,需要对其进行解析以提取所需数据。HTMLDOM是解析网页的常用工具。
Sub GetSearchBox Dim objIE As Object Dim objDoc As Object Set objIE = CreateObject objIE.Visible = False objIE.Navigate \"http://www.baidu.com\" Do While objIE.Busy Or objIE.readyState 4 DoEvents Loop Set objDoc = objIE.document Debug.Print objDoc.getElementById.Value End Sub
五、网页自动化操作
在解析网页时,可能需要进行自动化操作,如填写表单、点击按钮等。
Sub SearchKeyword Dim objIE As Object Dim objDoc As Object Set objIE = CreateObject objIE.Visible = False objIE.Navigate \"http://www.baidu.com\" Do While objIE.Busy Or objIE.readyState 4 DoEvents Loop Set objDoc = objIE.document objDoc.getElementById.Value = keyword objDoc.getElementById.Click End Sub
六、网页数据提取
在获取网页源代码并解析后,可以提取所需数据。
Sub GetSearchResult Dim objIE As Object Dim objDoc As Object Dim objDivs As Object Dim objDiv As Object Dim objLinks As Object Dim objLink As Object Set objIE = CreateObject objIE.Visible = False objIE.Navigate \"http://www.baidu.com\" Do While objIE.Busy Or objIE.readyState 4 DoEvents Loop Set objDoc = objIE.document Set objDivs = objDoc.getElementById.getElementsByClassName For Each objDiv In objDivs Debug.Print \"\" & objDiv.getElementsByTagName.innerText Set objLinks = objDiv.getElementsByTagName For Each objLink In objLinks If Left = \"http\" Then Debug.Print \"URL:\" & objLink.href End If Next objLink Debug.Print \"------------------------\" Next objDiv End Sub
七、批量网页抓取
在实际应用中,可能需要对多个网页进行抓取。
Sub BatchDownload Dim objIE As Object Dim objDoc As Object Dim objDivs As Object Dim objDiv As Object Dim objLinks As Object Dim objLink As Object Dim i As Integer Set objIE = CreateObject objIE.Visible = False For i = 0 To 9 objIE.Navigate \"http://www.baidu.com/s?wd=VBA&pn=\" & i * 10 Do While objIE.Busy Or objIE.readyState 4 DoEvents Loop Set objDoc = objIE.document Set objDivs = objDoc.getElementById.getElementsByClassName For Each objDiv In objDivs Set objLinks = objDiv.getElementsByTagName For Each objLink In objLinks If Left = \"http\" Then Call DownloadPage Exit For End If Next objLink Next objDiv Next i End Sub Sub DownloadPage Dim httpReq As Object, fsObj As Object, tsObj As Object Dim strHTML As String, strPath As String, strFileName As String, strContent As String, iFileNum As Integer Set httpReq = CreateObject httpReq.Open \"GET\", url, False httpReq.send \"\" strHTML = httpReq.responseText Set fsObj = CreateObject strPath = \"C:\\Temp\\\" strFileName = Replace, \"/\", \"-\"), \":\", \"\") & \".html\" Set tsObj = fsObj.OpenTextFile tsObj.Write strHTML tsObj.Close End Sub
八、反爬虫处理
在进行网页抓取时,需要注意反爬虫处理。
验证码:当访问次数过多时,网站会要求用户输入验证码。
IP限制:当同一IP地址对某个页面进行访问次数过多时,网站会自动禁止访问。
User-Agent限制:当同一User-Agent对某个页面进行访问次数过多时,网站会自动禁止访问。
Referer限制:当请求来源不合法时,网站会自动禁止请求。