Screen Scraping – the Last Best Option

I have a confession to make. I am a people pleaser. I really want people to like the software I write and I agonize over the possibility that someone somewhere is muttering obscenities as they use something I wrote. When I write a blog post I compulsively look at analytics stats for the next couple of days wondering ‘did they like it?’ ‘how long did people spend reading it?’ ‘how many people read it?’ ‘where did they come for?’ ‘what about now?’ ‘how many people are reading it right now?’ ‘what about now?’
‘what about now?
‘what about now?’

It’s truly pathetic.

This need to please translates to a need to write software that works just right and doesn’t needlessly suck or waste your time. I want you to smile, or at least not scowl, when you use it and be glad that I wrote it.

It means that once I’ve set my mind on having functionality that works a certain way I’ll hack away until I am satisfied with no regard to the opinions of the product’s vendor whatsoever.

The platform will obey me.

…If I recall correctly, at some point I mentioned that I’ve been building some Office 365 apps.

Overall I find the experience to be most pleasing and that the API’s Microsoft has given us are adequate to build anything you might desire, almost. However, every once in a while you’ll come across a need that isn’t directly supported by these API’s, or maybe it is supported but you don’t know it or can’t figure out how to do it even after thorough researching and tinkering. When this happens you have a choice: you can give up and make your audience sad, or you can get out your hacksaw, blowtorch, and duct tape and make them clap.

Screen scraping using JavaScript is a great example of duct tape. With it, you can make your code do for the user anything she can do manually through the browser subject to her permissions within your site’s domain. It is why cross site scripting attacks are so dangerous – if you can get JavaScript into a user’s page, you can do anything the user can do and because of the way the web works there is very little the vendor can do to stop it. It is the reason we have app webs and app parts – to keep your JavaScript out of the SharePoint page and the SharePoint site’s home domain by isolating your app in its own domain.

The jQuery library makes screen scraping really easy, but it should always be a last resort because the vendor will not support you nor can you expect it to work in the next version (although most of the pages you’d want to scrape in SharePoint haven’t changed for several versions).

The Scenario – Creating a OneNote 2013 Notebook in Office 365

I was surprised to discover recently that a OneNote notebook is a special type of folder in SharePoint 2013 that contains documents that comprise the notebook’s individual sections. The SharePoint folder is associated with OneNote via its ProgId property. Unfortunately I couldn’t find a way create a OneNote notebook directly or a way to set this property to turn a folder into a notebook via the client object model, but it can be done with very little screen scraping code.

Some pseudo code for this JavaScript is…

  1. Try to get the notebook folder
  2. If there is an error, the folder wasn’t found. Call the function to create it, otherwise there is nothing to do
  3. Issue an HTTP GET request to the CreateNewDocument page specifying at a location and document type 4 (OneNote notebook)
  4. Parse the content of the GET response to create a new HTML document
  5. Find the input element for the file name and set it to the name of the new notebook
  6. Find the input element that specifies the name of the control that caused the ASP.NET post back event and set it to the name of the OK button
  7. Serialize the HTML document’s form to get the post data
  8. Post the form to create the notebook

Figuring this out

This example solution relies on JQuery to talk to SharePoint and illustrates the use of the ajax(), get(), and post() methods to communicate with a web server of any type. Understanding these is to almost all web communication scenarios. It also relies on jQuery for parsing and document manipulation. The robustness of jQuery makes screen scraping a web site palatable because it doesn’t rely on complex and fragile parsing of the document as raw text. It’s also easy to read.

The other tool I used is one I consider essential to modern development, Fiddler Web Debugging Proxy from Telerik. All I have to do is turn it on, do the operation I am interested in, and record the results. Once I have the results I can use Fiddler’s inspectors to see what I need to put in my script.

The pictures below show these steps in action.

Remember that this is a last resort technique! Don’t overuse it! Be prepared to fix it as the target platform evolves!

Happy coding and don’t hurt yourself!

–Doug

Writing screen scraping code step by step.

Step 1. Start Fiddler (if the site is https you’ll need to do a little configuration to decode the traffic)

 

Step 2. Do the work manually and trace the traffic

 

Step 3. Inspect the trace information to get the required query string parameters and form fields

Step 4. Create the script

Author: Doug Ware