Scraping ASP Web Pages

For a couple of years now, I’ve been using a Python based web scraper that runs once a day on morph.io to scrape planning applications from the Isle of Wight website into a simple SQLite database. (It actually feeds a WordPress plugin I started tinkering with to display currently open applications in a standing test blog post. I really should tidy that extension up and blog it one day…)

In many cases you can get a copy of the HTML content of page you want to scrape simply by making an http GET request to the page. Some pages, however, display content on a particular URL as a result of making a form request on a particular page that makes an http POST request to the same URL, and then gets content back dependent on the POSTed form variables.

In some cases, such as the Isle of Wight Council planning applications page, the form post is masked as a link that fires of a Javascript request that posts form content in order to obtain a set of query results:

The Javascript function draws on state baked into the page to make the form request. This state is required in order to get a valid response – and the lists of the current applications:

We can automate the grabbing of this state as part of our scraper by loading the page, grabbing the state data, mimicking the form content and making the POST request that would otherwise by triggered by the Javascript function run as a result of clicking the “get all applications” link:

import requests

#Get the original page
response = requests.get(url)

#Scrape the state data we need to validate the form request
soup=BeautifulSoup(response.content)
viewstate = soup.find('input' , id ='__VIEWSTATE')['value']
eventvalidation=soup.find('input' , id ='__EVENTVALIDATION')['value']
viewstategenerator=soup.find('input' , id ='__VIEWSTATEGENERATOR')['value']
params={'__EVENTTARGET':'lnkShowAll','__EVENTARGUMENT':'','__VIEWSTATE':viewstate,
        '__VIEWSTATEGENERATOR':viewstategenerator,
        '__EVENTVALIDATION':eventvalidation,'q':'Search the site...'}

#Use the validation data when making the request for all current applications
r = requests.post(url, data=params)

In the last couple of weeks, I’ve noticed daily errors from morph.io trying to run this scraper. Sometimes errors come and go, perhaps as a result of the server on the other end being slow to respond, or maybe even as an edge case in scraped data causing an error in the scraper, but the error seemed to persist, so I revisited the scraper.

Running the scraper script locally, it seemed that my form request wasn’t returning the list of applications, it was just returning the original planning application page. So why had my  script stopped working?

Scanning the planning applications page HTML, all looked to be much as it was before, so I clicked through on the all applications link and looked at the data now being posted to the server by the official page using my Chrome browswer’s Developer Tools (which can be found from the browser View menu):

Inspecting the form data, everything looked much as it had done before, perhaps except for the blank txt* arguments:

Adding those in to the form didn’t fix the problem, so I wondered if the page was now responding to cookies, or was perhaps sensitive to the user agent?

We can handle that easily enough in the scraper script:

#Use a requests session rather than making simple requests - this should allowing the setting and preservation of cookies
# (Running in do not track mode can help limit cookies that are set to essential ones)
session = requests.Session()

#We can also add a user agent string so the scraper script looks like a real browser...
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
session.headers.update(headers)

response =session.get(url)
soup=BeautifulSoup(response.content)

viewstate = soup.find('input' , id ='__VIEWSTATE')['value']
eventvalidation=soup.find('input' , id ='__EVENTVALIDATION')['value']
viewstategenerator=soup.find('input' , id ='__VIEWSTATEGENERATOR')['value']
params={'__EVENTTARGET':'lnkShowAll','__EVENTARGUMENT':'','__VIEWSTATE':viewstate,
        '__VIEWSTATEGENERATOR':viewstategenerator,
        '__EVENTVALIDATION':eventvalidation,'q':'Search the site...'}

#Get all current applications using the same session
r=session.post(url,headers=headers,data=params)

But still no joy… so what headers were being used in the actual request on the live website?

Hmmm… maybe the server is now checking that requests are being made from the planning application webpage using the host, origin and or referrer attributes? That is, maybe the server is only responding to requests it things are being made from its own web pages off it’s own site?

Let’s add some similar data to the headers in our scripted request:

session = requests.Session()
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
session.headers.update(headers)

response =session.get(url)
soup=BeautifulSoup(response.content)

viewstate = soup.find('input' , id ='__VIEWSTATE')['value']
eventvalidation=soup.find('input' , id ='__EVENTVALIDATION')['value']
viewstategenerator=soup.find('input' , id ='__VIEWSTATEGENERATOR')['value']
params={'__EVENTTARGET':'lnkShowAll','__EVENTARGUMENT':'','__VIEWSTATE':viewstate,
        '__VIEWSTATEGENERATOR':viewstategenerator,
        '__EVENTVALIDATION':eventvalidation,'q':'Search the site...'}

#Add in some more header data...
#Populate the referrer from the original request URL
headers['Referer'] = response.request.url
#We could (should) extract this info by parsing the Referer; hard code for now...
headers['Origin']= 'https://www.iow.gov.uk'
headers['Host']= 'www.iow.gov.uk'

#Get all current applications
r=session.post(url,headers=headers,data=params)

And… success:-)

*Hopefully by posting this recipe, the page isn’t locked down further… In mitigation, I haven’t described how to pull the actual planning applications data off the page…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.