PowerShell でスクレイピングしてみる

唐突だけど、ちょっと PowerShell で Web サイトをスクレイピングしてみる。

PowerShell で HTTP リクエストを送信するには Invoke-WebRequest コマンドレットを使う。

PS D:\Users\dacci> Invoke-WebRequest 'http://example.org/'


StatusCode        : 200
StatusDescription : OK
Content           : <!doctype html>
                    <html>
                    <head>
                        <title>Example Domain</title>

                        <meta charset="utf-8" />
                        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
                        <meta name="viewport" conten...
RawContent        : HTTP/1.1 200 OK
                    Vary: Accept-Encoding
                    X-Cache: HIT
                    Accept-Ranges: bytes
                    Content-Length: 1270
                    Cache-Control: max-age=604800
                    Content-Type: text/html
                    Date: Fri, 26 May 2017 14:51:17 GMT
                    Expires: ...
Forms             : {}
Headers           : {[Vary, Accept-Encoding], [X-Cache, HIT], [Accept-Ranges, bytes], [Content-Length, 1270]...}
Images            : {}
InputFields       : {}
Links             : {@{innerHTML=More information...; innerText=More information...; outerHTML=<A href="http://www.iana
                    .org/domains/example">More information...</A>; outerText=More information...; tagName=A; href=http:
                    //www.iana.org/domains/example}}
ParsedHtml        : System.__ComObject
RawContentLength  : 1270

結果は何かのオブジェクトで返ってくるらしい。どんなオブジェクトなのか調べてみる。

PS D:\Users\dacci> Invoke-WebRequest 'http://example.org/' | Get-Member


   TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject

Name              MemberType Definition
----              ---------- ----------
Dispose           Method     void Dispose(), void IDisposable.Dispose()
Equals            Method     bool Equals(System.Object obj)
GetHashCode       Method     int GetHashCode()
GetType           Method     type GetType()
ToString          Method     string ToString()
AllElements       Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {get;}
BaseResponse      Property   System.Net.WebResponse BaseResponse {get;set;}
Content           Property   string Content {get;}
Forms             Property   Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;}
Headers           Property   System.Collections.Generic.Dictionary[string,string] Headers {get;}
Images            Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;}
InputFields       Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {get;}
Links             Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;}
ParsedHtml        Property   mshtml.IHTMLDocument2 ParsedHtml {get;}
RawContent        Property   string RawContent {get;set;}
RawContentLength  Property   long RawContentLength {get;}
RawContentStream  Property   System.IO.MemoryStream RawContentStream {get;}
Scripts           Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;}
StatusCode        Property   int StatusCode {get;}
StatusDescription Property   string StatusDescription {get;}

HtmlWebResponseObject というクラスらしい。仕様は MSDN ライブラリーに書いてある。どうやら ParsedHtmlIHTMLDocument2 らしいので、ここから DOM ツリーにアクセスできそうだ。やってみよう。

PS D:\Users\dacci> $Response = Invoke-WebRequest 'http://example.org/'
PS D:\Users\dacci> $Response.ParsedHtml.getElementsByTagName('p') | Select-Object innerText

innerText
---------
This domain is established to be used for illustrative examples in documents. You may use this domain in examples wi...
More information...

ちょっと分かりにくいか?こうしてみよう。

PS D:\Users\dacci> $Response = Invoke-WebRequest 'https://www.google.co.jp/search?q=BKB&tbm=isch'
PS D:\Users\dacci> $Response.ParsedHtml.getElementsByTagName('img') | Select-Object src

src
---
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTADCtmeQE39WiR8PW9lEQIjb0dhBGlyBVWIrvSsSmG76E6TJAOROLgb48
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSNrhqxpmfhS7q2Kfnks44XWBMIFvFMwCa1E37zqjWTfaT1-16hHRTqDNzG
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSVdZ1JsRC__JfXyQGjLlAbVZHJMkfJwStbiWMj17ya0MEFLS3fHk3HIl4
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRly7RQUl2_O-RpqF6WpOa-G2gsFkdtARasNCx6izoD1oIlIHgtWAMdrCo
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQXtostzTjbjlRCpfMaOIYyU9wm2kHBPHaEPZc2v30s9jGIHmv8bnQ_8w8U
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR7Ea1x3z5BWf0hy0CtX10KCTNZTUwWXf7wygj12PvLmzH4Zk_Ix_C5Qsw
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRpDTbrxtseDJd8Yvr9qfIzhTg8xPg1SdCp2AGIw42CAUWrWAlDyu0Llx0x
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTBk3FCVXf3QF3fj8W9ZzYoF8JFuvkS9ECbmqsEYP9FQ39xL4L3lNXak2QT
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR910YJ_nt9ClK4vZvemyRJWKsuS5AwUCOymnJmEQRaXb51GAfzOpdGw2xW
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTf7O1MFtTszRljwOXdv6FCgcit_wzvsGzQ4QCdfoIS41TXxAvB9Kxe7o6f
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSfp0Sinu_qg-DEZ5buQ3Rrsf5RiBHG82JnZLE-pNGgjk7NzMTzOxVXLw
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT39kupqy7s37liS_PIWBGzZOfiLP3a8TnOpVwMYckz7pm80S7NPwoUgXb3
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFr4ui3muADgW0nTge0L8nmQoE1lV-zgvOimKndW95mVcusXSnM73ZMyAB
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1NoxwZLTFdh94brkMcdoQy-xrQBVPwEE9jAbqRcL6_voBU4wy98HORis
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTpKn-WIbpDVS9OKB8I8LieNXypojFXrNhg4hQCuPszzceJ6vrC_p9OCw
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFNG93iXtF6Vns3YNWF_RWeTekpc1K512jeVF3NOHwCPXQzGrqlQMptg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRHthT8TIbqet-5ai5Kd3BONFWtcPYX181rH7W9_6WkMVvKfIbpSAFjlTM
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQEObWAyf2Vg0fIj9woFNNBS8YeTuKm0K8UiZoZd0Lrsc6aTyYLLL4IZ7M
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTyO7kalfP-76Ls3bZdVfK9n28hcVEeEUjA3Fqi0huuaO2j9jMofBWmVnk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTzCczGfshsuKVNRe3PxSM7brxaWh-w0rkmBTzUddBoGyv4llzUbb5364FS

もっとも、ページ内の画像を取り出したいなら Images プロパティを使ったほうが早い。

PS D:\Users\dacci> $Response.Images | Select-Object src

src
---
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTADCtmeQE39WiR8PW9lEQIjb0dhBGlyBVWIrvSsSmG76E6TJAOROLgb48
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSNrhqxpmfhS7q2Kfnks44XWBMIFvFMwCa1E37zqjWTfaT1-16hHRTqDNzG
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSVdZ1JsRC__JfXyQGjLlAbVZHJMkfJwStbiWMj17ya0MEFLS3fHk3HIl4
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRly7RQUl2_O-RpqF6WpOa-G2gsFkdtARasNCx6izoD1oIlIHgtWAMdrCo
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQXtostzTjbjlRCpfMaOIYyU9wm2kHBPHaEPZc2v30s9jGIHmv8bnQ_8w8U
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR7Ea1x3z5BWf0hy0CtX10KCTNZTUwWXf7wygj12PvLmzH4Zk_Ix_C5Qsw
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRpDTbrxtseDJd8Yvr9qfIzhTg8xPg1SdCp2AGIw42CAUWrWAlDyu0Llx0x
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTBk3FCVXf3QF3fj8W9ZzYoF8JFuvkS9ECbmqsEYP9FQ39xL4L3lNXak2QT
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR910YJ_nt9ClK4vZvemyRJWKsuS5AwUCOymnJmEQRaXb51GAfzOpdGw2xW
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTf7O1MFtTszRljwOXdv6FCgcit_wzvsGzQ4QCdfoIS41TXxAvB9Kxe7o6f
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSfp0Sinu_qg-DEZ5buQ3Rrsf5RiBHG82JnZLE-pNGgjk7NzMTzOxVXLw
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT39kupqy7s37liS_PIWBGzZOfiLP3a8TnOpVwMYckz7pm80S7NPwoUgXb3
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFr4ui3muADgW0nTge0L8nmQoE1lV-zgvOimKndW95mVcusXSnM73ZMyAB
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1NoxwZLTFdh94brkMcdoQy-xrQBVPwEE9jAbqRcL6_voBU4wy98HORis
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTpKn-WIbpDVS9OKB8I8LieNXypojFXrNhg4hQCuPszzceJ6vrC_p9OCw
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFNG93iXtF6Vns3YNWF_RWeTekpc1K512jeVF3NOHwCPXQzGrqlQMptg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRHthT8TIbqet-5ai5Kd3BONFWtcPYX181rH7W9_6WkMVvKfIbpSAFjlTM
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQEObWAyf2Vg0fIj9woFNNBS8YeTuKm0K8UiZoZd0Lrsc6aTyYLLL4IZ7M
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTyO7kalfP-76Ls3bZdVfK9n28hcVEeEUjA3Fqi0huuaO2j9jMofBWmVnk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTzCczGfshsuKVNRe3PxSM7brxaWh-w0rkmBTzUddBoGyv4llzUbb5364FS

ちなみに、Invoke-RestMethod というコマンドレットもある。こっちは API 呼び出しに向いていて、レスポンスが JSON だったら PSObjectXML だったら XmlDocument というふうに空気を読んで返すオブジェクトを変えてくる。