Big Dater – Finding the biggest cheaters with Power BI

Hacking of commercial enterprises is never a good thing and in no way would I condone the dubious logic that was the basis behind the hack and release of the Ashley Madison database. There was no doubt a lot of personal and commercial damage caused.

But with every one of these hacks, there is opportunity. The hack has arguably done more for marriage fidelity than Fatal Attraction and has also given us a chance to relook at data security practices, personal data storage and password management practices. For me it’s a good chance to bust out one of this new tools in the BI suite Power BI on a good sized, real world dataset.

If you don’t want to run through the technical just scroll down to the results section below.

Power BI

Power BI comes in both a desktop and a hosted service model. I’m going to use the desktop model which is available here. It is a powerful BI tool which operates over PowerQuery, the data processing and transformation engine introduced in a previous blog. The tools are built for scale and so are not limited by those annoying little 1 million row limits of Excel. That’s a good thing because there is a lot of data in the Ashley Madison dataset.

Preparing the Data

The data can be downloaded from a number of places. I chose a torrent file sharing link with a set of zipped up files with an encryption key that matches the one that the hackers published to avoid tampering. The data comes as number of .dump files which are the tables from a mySql database, a set of .csv transaction files and a collection of other internal files. All up about 10GB of zipped up data to be explored.

The .dump files are mySQL dumps. These are a CREATE TABLE statement followed by lots and lots of INSERT statements. That’s great if you want to reconstruct the database (and if your prepared to wait a really long time to insert them all, but we don’t really need all the complexity of a relational database, flat files are just what we need. So I wrote a script to parse the INSERT statements and turn it back into a flat CSV file (with a little bit of finesse to get the right quote style for PowerQuery)

The daily transaction CSV folder is much easier, just trim off the titles and concatenate the contents into one great big CSV file. I’ve written a small c# script (attached) to concatenate the data and do a bit of cleaning along the way (it looks like the hackers, in their haste didn’t flush some files properly so there are a few truncated lines). Once the data is together it makes one great big 2.5GB AllTransactions.csv file with 10 million transactions!

The code is available here

Now, we get our first glimpse at the magnitude of the data available.

File Type Size (GB) Rows Columns Contents
amino_member.dump mySQL dump 12.7 36,399,119 61 All member details
am_am_member.dump mySQL dump 9.5 31,971,589 48 Some member details
member_details.dump mySQL dump 2.3 36,416,003 12 Member details summary table
aminno_member_email.dump mySQL dump 1.7 36,397,897 8 Member emails and notification they have signed up for
member_login.dump mySQL dump 4.2 36,400,530 5 Every account, username and password hash
Transactions folder Daily CSV 2.4 9,692,318 19 Every credit card transaction, who made it and where from
(partial credit card details)

Transactions

There has been a lot of discussion online about the actual validity of the database since it appears there is some irregularity about the makeup of population of users. There is a heavy bias of males to females (which might be expected) and the females appear to not be distributed properly, suggesting that he female population was actually manufactured by paid employees. To avoid that discussion, I’m going to look at the one part of the dataset that doesn’t lie. Money! The Ashley Madison dataset includes every credit card transaction since inception in a daily csv file.

Start the Power BI application and launch the GetData wizard. This has a large number of different data source options but we are going with plain old .csv to load the AllTransactions.csv file.

This starts the PowerQuery editor and gives the first look at the data. There’s plenty of juicy information in here but gladly the full credit card number is truncated!

So now is the opportunity to pre-process the data for analysis. Choose Edit and then in the PowerQuery editor use the Choose Columns button. First we’ll pick the interesting columns for analysis.

Since we are in development mode, and the data set is so large, it’s worth trimming the data set to keep it manageable. Use the “Keep Rows button” to keep just 100,000 rows then make a few changes.

PowerQuery is great for importing data from many sources and dealing with large data sets. The code development experience is pretty ordinary so I recommend using PowerQuery to trim up and filter out unnecessary data to make it as small as possible and leave the complex stuff for a later (see DAX below). Trim up the data like this:

  1. Date column is actually a DateTime so rename it DateTime. Then use the Date button to add a “DateOnly” column, and a “Day”, “Month” and “Year” column.
  2. There are a number of failed attempts to charge against a credit card, these are identified by a not null value in the ERROR_CODE column. So trim out those rows and drop the column
  3. The credit card transactions come in 4 types:
    Authorizations – Submit credit card details to take up the free membership offer (Ignore these ones)
    Settlements – Commit previous Authorization (Add these ones)
    Purchases – Buy/Renew a membership outright (Add these ones)
    Credits and Chargebacks – The wife found out, so quickly back out! (Subtract these ones)
  4. Use the “Add Custom Column” to add a new column called “Value” which represents the actual value of the transaction using an if statement to add just the Settlements and Purchases and subtracting the refunds.
  5. Add a UserId column that will help identifying unique users. There’s plenty of personally identifiable information which can be combined to form an effective “UserId”. I have used these FIRST NAME, CARD ENDING and ZIP columns which will give a reasonable approximation for a unique User Id so choose “No Separator” and name the column UserId. Select the columns and hit the Merge Column button.
  6. Finally remove all unused columns including all that nasty personal and credit card information from the data set so we can safely publish.

The Advanced Query button shows the Power Query code that was generated by the import wizard and transformation steps

Now “Close and Load” the data into Power BI.

Currency

One of the nice challenges in the data set is the Amount column is recorded in the native currency of the credit card. So if we want to do any reporting across different countries then that number needs to be normalised to a single currency such as the $US. This is easy if we use the Excel spreadsheet created in my previous blog here Use Excel, PowerQuery and Yahoo Finance to manage your Portfolio. Download the Excel workbook from that blog and open it.

First we need a data source that links CountryCode to CurrencyCode, theres plenty available. I chose this one http://fxtop.com/en/countries-currencies.php . One of the really powerful features of PowerQuery is the ability to directly parse a web page into tables which can be imported into a PowerQuery table using the “From Web” data source.

Open the http://fxtop.com/en/countries-currencies.php page.

Now using PowerQuery tool bar rename the Country column to CountryCode and Split the Currency column by the first 3 characters

Drop all the other columns we don’t need and finally, to retrieve the Rate from yahoo you need the currency in the form of AFNUSD=X so create a new Custom Column called Symbol.

Which generates this table

Paste the Symbols into the Symbol column of the Portfolio tab then hit Refresh Data and you’ll end up with a table like this with the latest currency rates populated from Yahoo Finance in the “Last Trade Price Only” column.

Add one extra row in the table to handle those records with no country set and assume they are US dollars

Save that as currencies.xlsm and then back to Power Query to use the Get Data -> Excel to import it

And import the Portfolio table from the Excel workbook.

Remove the columns other than CountryCode and Rate.

Now we will be using this table to do a significant large join to bring currency rate data onto the transaction table. To do this efficiently we can help the join performance by providing a primary key on the table. There is no GUI wizard feature to do this so open the Advanced Editor and add this line to set a Primary key on the CountryCode column to be used in the join engine.

  #"Indexed Table" = Table.AddKey( #"Renamed Columns", {"CountryCode"}, true) 

Here’s the currency query:

Global Data

Given that the site was a global phenomenon, we want to get some global perspective on the data so let’s upload some global reference data from here https://countrycode.org/ which has some useful country stats.

Import the page using New Source->Web feature in Power Query.

Use the PowerQuery GUI to

  1. Rename the GDP column to and Split it on the first space ‘ ‘
  2. The create a new Multiplier column by parsing the Million and Billion words into numbers using an if statement

    #”Added Custom” = Table.AddColumn(#”Changed Type1″, “Multipler”, each
    if [GDP.2]=”Million” then 1000000 else
    if [GDP.2]=”Billion” then 1000000000 else
    if [GDP.2]=”Trillion” then 1000000000000 else 1),

  3. Split the ISO CODES column by the first ‘/’ and rename the first column CountryCode
  4. Remove all the unnecessary columns
  5. And add an Index on CountryCode for the join performance

The table should look like this

Here’s the generated global query:

https://gist.github.com/petermreid/23e64049b0aa270a1240

Use Close and Apply to run all the queries

Relationships

At this point the data sets are all completely independent. On the left hand side of the Power BI Desktop is a button to switch to Relationship mode.

Power BI will do a reasonable job of guessing the relationships but may need a little help. If not already there, create a one-to-many relationship with the transaction table by CountryCode by dragging and dropping the CountryCode column from the “many” fact table to the “one” dimension tables. (The use of the multidimensional terms used in SQL Server Analysis Services is no mistake as we will see later)

Data Typing

Now switch into Report mode

There’s a bit of additional data type coercion that needs to happen to help Power BI make the right choices when presenting the data. For each of the geographical entities use the Data Category menu to assign the correct type of geographical data.

This will put a small globe icon next to each and will be useful in the mapping process.

Do the same for the currency fields setting to Data Type currency.

DAX

Now it’s time to generate some new columns of data that join across related tables. This is the first opportunity to use DAX (or Data Analysis eXpressions) which is a real pleasure. This is the query language under PowerPivot and DAX has great debugging, helpful error handling and intellisense. On the transactions table use New Column

to add the following columns

Quarter = "Q" & transactions[Qtr]  -- the year quarter string
Value US = transactions[Value]*RELATED('currency'[Rate]) -- the value in $US

(Note the use of the ‘RELATED’ keyword to traverse the join relationship to the currency table)

Value US per Capita(M) = IFERROR(transactions[Value US]/RELATED('global'[POPULATION])*1000000,0) -- the value in $US per million people

(Note how the variables don’t have to follow normal syntax rule, spaces and punctuation are fine)

Value US per GDP(M) = IFERROR(transactions[Value US]/RELATED('global'[GDP])*1000000,0) --the value in $US per million GDP
					

Report

Now the fun part, build a report! I did this

And this

And this

And this

Now we can start to really see the data. Don’t forget the transactions query is limited to 100,000 rows which is not even all of the first year of operation. Ashley Madison, by any measure was already a success having brought in $US3 million in revenue, with 39,000 paying customers!

Scaling Up

Once happy with the result, open up the transactions query

and remove the Kept First Rows step so the full data set is processed.

Now Close and Load the Power Query editor and go make a cup of tea. The full data set is over 2GB and 10 million rows so it’s going to take a while. My i7 8GB SSD laptop takes 20 minutes to process the data.

While that is working take a look at the Task Manager. The Microsoft Mashup Evaluation container is Power Query engine doing its work. But interestingly under the Power BI hood is SQL Server Analysis Services! What we are doing while using the Relationship editor and the Report builder is creating a “cube” which is then embedded in the Power BI .pbix file.

Results

Now with all the data we see the real success story that is Ashley Madison

  • $US240M in total revenue and on target to hit $US100M in 2015 alone (the hack only captured up to Q2 2015)
  • 1M paying customers and climbing rapidly

So the burning question which country spent the most on cheating per capita? It’s as simple as clicking on the largest red circle or clicking on (sorting by) Value US per Capita in the table.

The Vatican! It has 10 paying customers with 59 separate transactions totalling US$1303. We can even see that the initial transactions occurred in 2009, then a period of inactivity and then a return to paid up member status in 2013, 14, and 15. It’s a little bit unfair to label the Vatican since the population is so small it could be a statistical error, but given 1% of residents were paid up members, there’s certainly something interesting there. The other interesting cases are Palau, Falkland Islands and Bermuda.

Publishing

Publishing a report is as easy as pressing this button

The report is publised to here. It isn’t a static set of graphs and figures, it is the full cube which means all of the richness of the Power BI Desktop is available including filtering and drill downs on each of the report elements. Clicking on a circle in the map filters all other elements on the page with that category.

However, the published report is only available to licensed users.

Summary

This is some very powerful tooling at the early stages of development. Power Query needs some work but DAX is great so use Power Query to reduce the data down and use DAX to do the more complex stuff.

The Power BI Desktop Report Builder is primitive but getting better and there is a release every every month. It’s lacking some of the things you’d expect like colour control and font size control on text, z-order control on report elements and snap to grid layout control, theres some room for improvement before the tool is ready to take on the big players.

But thats just the thing Power BI isn’t a multi million dollar BI platform, its a desktop editing and SaaS publishing platform for BI, much like the way I use Word to create this blog which is then published to the Internet. Its another example of cloud computing at its best. A very capable tool once only available to enterprises put in the hands of everyone. This is a disruption for over priced software vendors and consultancies in the business intelligence space.

If you want to have a go at doing some more data analysis I’ve saved the anonymised Power BI files (with no personal information in it) here. The ‘transactions’ data set is the one used in this blog, the ‘members’ dataset has a much richer set of demographics information including sex, preferences, height, weight, dob, date joined etc. I’ll leave it to the reader (or the next blog) to find some interesting insights in there.

Using Azure Machine Learning to predict Titanic survivors

So in the last blog I looked at one of the Business Intelligence tools available in the Microsoft stack by using the Power Query M language to query data from an Internet source and present in Excel. Microsoft are making a big push into the BI space at the moment, and for good reason. BI is a great cloud workload. So now let’s take a look at one of the heavy hitters at the other end of the BI scale spectrum, Azure Machine Learning.

The list of services available in Azure is growing fast. In fact, it used to be that you could see them all on one page in the Azure Portal, but now I have 28 different icons to choose from. Many of these are traditional cloud infrastructure services like Virtual Machines, Networks and SQL databases but there are now many higher level services where the infrastructure under the hood has long since been forgotten. While bigger virtual machines and faster disks get plenty of publicity because they are easy to understand and compare between cloud vendors, it is the higher level services that are much more interesting, after all, this is what Cloud is all about, not cheaper virtual machines but services that solve business problems.

Azure Stream Analytics, Azure Machine Learning and Azure Search have all been added recently to the Azure platform and fall into the category of “don’t bother me with annoying infrastructure just answer this problem”. Recently the Microsoft Research team had an Internet sensation when they released How-Old.net which uses a trained machine learning model to guess how old the faces in an uploaded photo are. The simplicity of the problem and the user interface belie the huge amount of human knowledge and compute power that is brought together to solve it which is available to you and me today.

So what’s that science fiction sounding “Machine Learning” all about?

Machine Learning

Machine Learning is a data modelling environment where the tools and algorithms for data modelling are presented in an environment that can be used to test and retest a hypothesis and then use that model to make predications. Which all sounds a bit 1st year Uni Stats lecture, and it is. In browsing around the many samples for Azure Machine Learning most demonstrate how easy it is to use the tools but with very little depth or understanding and without the end to end approach from a problem to an answer. So let’s fix that.

There’s plenty of good background reading on machine learning but a good take away is to follow a well-defined method:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modelling
  • Evaluation
  • Refinement
  • Deployment

Business Understanding

So we need to find a problem to solve. There’s a lot of problems out there to solve, and if you want to pick one Kaggle is a great place to start. Kaggle has a number of tutorials, competitions and even commercial prize based problems that you can solve. We are going to pick the Kaggle tutorial Titanic survivors.

The business of dying on the Titanic was not a random, indiscriminate one. There was a selection process at play as there were not enough lifeboats for people and survival in the water of the North Atlantic very difficult. The process of getting into a life boat during the panic would have included a mixture of position (where you are on the boat), women and children first, families together and maybe some other class or seniority (or bribing) type effect.

Data Understanding

The data has been split into two parts, a train.csv data set which includes the data on who survived and who didn’t and a test.csv which you use to guess who died and who didn’t. So let’s first take a look at the data in Excel to get a better understanding. The data set of the passengers have the following attributes.

survival Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

 

First let’s take a look at the data in Excel. A quick scan shows there are 891 passengers represented in this training set (there will be more in the test set also) that the data is not fully populated, in fact the Cabin data is limited and Age has some holes in it. Also the columns PassengerId, Name, Ticket have “high cardinality”; that is there are many categories so they don’t form a useful way to categorise or group the data.

Now create a Pivot table and use Average of Survived as a Value in the Pivot table. With no other dinemsions we can see that 38% of the passengers survived. Now we can test some of the attributes and see if they have an effect on survival rate. This process is just to help understand if there is an interesting problem here to be solved and to test some of the assumptions about how lifeboats may have been filled.

There certainly seems to be some biases at play that are not random which can lift your survival rate as a passenger from the average 0.38 probablility. So let’s see what Azure Machine Learning can come up with.

Data Preparation

Create a new Azure Machine Learning environment if you don’t already have one. This will create a matching Azure Storage account; mine is called backtesterml.

Now upload the training data into the Azure Blob storage in a container called “titanic”. I use the AzCopy tool from the Azure Storage Tools download.

C:\Program Files (x86)\Microsoft SDKs\Azure>AzCopy /Source:C:\Users\PeterReid\Downloads /Dest:https://backtesterml.blob.core.windows.net/titanic /DestKey:yourstoragekey /Pattern:train.csv

Create a new Blank Experiment and rename it “Titanic Survivors” and drop a Reader on the design surface and point it at the Blob file we uploaded into titanic/train.csv

Add a Metadata Editor and rename the Survived column to Target. This is an indication that the model we will build is trying to predict the target value Survived

Now add a Project Columns and remove those columns that we think are not going to help with prediction or have too few values to be useful.

Then add a “Clean Missing Data” and set to “Remove entire row”. The Minimum and Maximum settings are set to remove all rows with any missing values rather than some other ratio.

Now Split the data in two. This is important because we want to create a model that guesses survival rate and then test it against some data that was not used to generate the model. With such a small data set, how we split the data is important. I have used a “Stratified” split on the “Target” survivor column so that we get an even representation of survivors in the predication and testing data sets. Also while in here change the Random seed to something other than 0 so that on repeated runs we get the same breakup of passengers.

Modelling

Now it’s time to revisits Statistics 101 so workout what to do with this data. There are a huge number of pre-built Models that can be dropped onto the design surface choosing the right one is a little bit art and a little bit science. Here’s the science

  • We have a classification problem, who survives and who doesn’t
  • It is a “Two-Class” problem since you can only survive or not

So start by opening Machine Learning->Initialize Model->Classification->”Two-Class Logistic Regression” onto the design surface. This is a binary logistic model used to predict a binary response (survival) based on one or more predictor variables (age, sex etc). Connect this to a “Train Model” with the prediction column “Target” selected and connect that to one side of a “Score Model” block. The other side of the Score Model connect up the data from the “Split” block and send the results into an “Evaluate Model”. Then hit Run!

After processing (note your using a shared service so it’s not as responsive as running something locally but is built for scale) it should look something like this:

Now one the run has completed you can click on the little ports on the bottom of the blocks to see the data and results. We are interested in the

Evaluation

Click on the output of the “Train Model” block which will bring up this window

This is what the output of the regression looks like. And it means something like this:

  • Having a large class value (e.g. 3rd class) has a big negative impact on survival (those on upper decks got out easier)
  • Being female has a big positive impact on survival and being male has an equal negative impact (women first as we thought)
  • Having a larger age has a reasonable negative impact on survival (children first as we thought)
  • And having a larger ticket price has a small positive impact (rich people had some preference)
    (Note Bias is an internally generated feature representing the regression fit)

OK so now we have the model built let’s see how good it is at predicting the survival rate of the other 25% of passengers we split off at the beginning.

Add and connect up a “Score Model” connect it to the right hand side of the Split block and then to the “Evaluate Model”. Then Run again

Click on the Output of the “Score Model” block to get this.

Here we have all of the remaining 25% of passengers who were not submitted to generate the model. Taking the first row, this passenger is Male, 3rd class, 20 years old and on a cheap fare given what we know about the lifeboat process he probably didn’t survive. Indeed, the model says the survival probability is only 0.119 and has assigned him a “Scored Label” of 0 which is a non-survivor. But the “Target” column says that in reality, this guy survived so the model didn’t predict correctly. The next Female was predicted properly etc, etc.

Now go to the “Evaluate Model” output. This step has looked at all the predictions vs actuals to evaluate the model’s effectiveness at prediction.

The way to read this charts is to say for every probability of survival what percentage actually survived. The best possible model is represented by the green line and a totally random pick is represented by the orange line. (You could of course have a really bad model which consistently chooses false positives which would be a line along the bottom and up the right hand side). The way to compare these models is by looking at the “Area Under the Curve” which will be somewhere between 0 (the worst) and 1 (the best). Our model is reasonably good and is given an AUC score of 0.853.

Refinement

There are many ways to refine the model. When we first looked at the data there was something in that Age chart. While there certainly is a bias toward younger survivors. It’s not a linear relationship, there is a dramatic drop off at 16 and a slight increase in survival rate after 45. That fits with the expected lifeboat filling process where you would be considered as your either Young, or Old or Other.

So let’s build that into the model. To do that we will need to “Execute R-Script”. (R is a very powerful open source language and libraries built for exactly this purpose.) Drag a script editor onto the design surface, connect up the first input and enter this script:

Then connect the output to the Metadata Editor. Open up the Project Columns block and remove Age and add the new AgeGroup category. Then hit Run.

So let’s use the power of infinite cloud computing and try the brute force method first. Add all of the Two-Class regressions onto the design surface and connect them up to the Score and Evaluate blocks then Run.

Amazingly the whole thing runs in less than 2 minutes. I have done two runs, one using Age and one using AgeGroup to see which models are sensitive to using the categorisation. What is interesting here is that despite some very different approaches to modelling the data the results are remarkably the same.

Model AUC Age AUC Age Group
Two-Classed Locally Deep Support Vector Machine

0.845

0.86

Two-Classed Neural Network

0.826

0.883

Two-Classed Support Vector Machine

0.871

0.843

Two-Classed Average Perceptron

0.864

0.869

Two-Classed Decision Jungle

0.844

0.85

Two-Classed Logistic Regression

0.861

0.858

Two-Classed Bayes Point Machine

0.856

0.85

Two-Classed Boosted Decision Tree

0.834

0.836

Two-Classed Decision Forest

0.827

0.815

 

Some of the models benefitted from the AgeGroup category and some didn’t, but the standout success is the Neural network with AgeGroup. Now it’s time to play with the Neural network parameters, iterations, nodes, parameter sweeps etc to get an optimal outcome.

Deployment

Now back to the task at hand, we started with a train.csv set of data and a test.csv data set. The aim is to determine survivors in that test.csv data set. So now “Create Scoring Experiment” and “Publish Web Service”. This will make a few modifications to your design surface including saving the “Trained Model” and plumbing it in with a web service input and output. Ive added an extra “Project Columns” so only the single column “Scored Label” is returned to indicate whether the passenger is a survivor or not. Which will look like this

Then “Publish the Web Service” which will create a web service endpoint that can be called with the input parameters of the original dataset

Now by downloading that Excel Spreadsheet and pasting in the data from the test.csv, each row generates a request to our newly created web service and populates the survival prediction value. Some of the data in the test data set has empty data for Age which we will just set to the Median age of 30 (there is significant room for improvement in the model by taking this into account properly). The beta version of the Spreadsheet doesn’t support spaces or delimiters so there is a little bit of extra data cleansing in Excel before submitting to the web service (remove quotes).

TitanicPredictions

Heres the first few rows of the scored test data. Each one of these passengers were not present in the train data which stopped at passenger 891. So the “Predicted Values” is the assigned survival flag as predicted by our trained model. As a sanity check over the results the survival rate is about 27%, similar, but slightly less than the training data. Passenger 892 looks like a reasonable candidate for non survival and Passenger 893 looks like a candidate for survival.

So then prepare the data and submit to the Kaggle competition for evaluation. This model is currently sitting at position 2661 out of 17390 entries! (there are clearly a few cheaters in there with the source data available)

The process of developing this model end-to-end shows how the power of Cloud can be put to use to solve real world problems. And there are plenty of such problems. The Kaggle web site not only has fun competitions like the Titanic survivor problem, but is also a crowd sourcing site for resolving some much more difficult problems for commercial gain. Currently there is a $100K bounty if you can “Identify signs of diabetic retinopathy in eye images”.

While these sorts of problems used to be an area open to only the most well-funded institutions, Azure Machine Learning opens the opportunity to solve the world’s problems to anyone with a bit of time on a rainy Sunday afternoon!

 

Use Excel, PowerQuery and Yahoo Finance to manage your Portfolio

There are some new and powerful Business Intelligence options available to all of us, from scalable cloud platforms to the Excel on the desktop. So its a good time to demonstrate some of them and show where BI and Application Development meet.

Power Query and the language M have been quietly making their way into the Microsoft Business Intelligence stack and in particular Excel for some time. The set of “Power” components (Power Query, Power Pivot and Power BI) all sound a bit like the marketing guys won the naming battle again and don’t really help explain the technology nor where they fit so I avoided it for a long time. That is until this latest version of Office where you can’t really avoid it. So it came time to take Power Query and M by the scruff and learn it.

First let’s put these new “Power” technologies in their relative places:

  • M is a new language that takes on the ET from ETL (Extract Transform and Load). It is a scripting language which looks like an odd mix of VBScript and PowerShell and “R”. It has constructs for importing and manipulating Tables, Lists and Records including the ability to do joining and grouping
  • Power Query is a scripting environment which is an add-on to Excel adding GUI support for creating and editing M scripts. It hides away the M language from the punter but you’ll quickly find the need to drop into the M script editor for all but the basic scripting chores.
  • Power Pivot is a multidimensional analysis engine without the need to use the full MSSQL OLAP Analysis Services. PowerPivot is available as a plug in for Excel but executes over what is called the Data Model an in memory database capable of much larger datasets than Excel.
  • Power BI Designer is a standalone BI development environment which includes Power Query and PowerPivot. It’s for the power business user to develop reports from data without the need to drop into or distribute Excel.

So I set myself a goal to create a spreadsheet that auto updates with stock prices.

First get a version of Excel (2010 or above) which supports Power Query and download the Power Query installer this is already embedded in the later versions of Excel but may need an update. After installing you should get a set of new buttons like this. The New Query button replaces the previous Get External Data features of Excel and is the entry point to load data into the Power Query environment.

Yahoo Finance

Yahoo Finance have kindly exposed a number of APIs for requesting stock price information (if used for non-commercial purposes), current and historical for just about every instrument you can think of. These APIs are simple web queries but predate the REST API conventions so don’t follow nice RESTful best practice. The two services we’ll use are:

Historical: http://ichart.finance.yahoo.com/table.csv?s=CPU.AX&a=00&b=4&c=2014&d=07&e=01&f=2015&g=d&ignore=.csv
This will return historical prices between two dates.

Quote: http://download.finance.yahoo.com/d/quotes.csv?s=CPU.AX&f=nsl1op.
This will return the latest quote for the stock CPU.AX with a number of fields identified by the f parameter.

Both APIs are very well documented (counter intuitively) at Google Code

Pulling this data into Power Pivot is really easy from which it’s even easier to post process and report.

Historical Query

So let’s import one of these services. The New Query button has a huge number of different data sources. For this purpose, we’ll use simply “Web” and import the historical data service by pasting in the above URL. This will query the service, display the data and launch the Power Query editor.

Here’s our first glimpse of the new data staging area. There’s a whole lot of buttons here for transforming and massaging the returned data and everything that is done with the buttons in the GUI become M statements in an M query which can be a great way to learn the language. And there is a data grid (not Excel this one is part of Power Query) to show the query results.

Try hitting the button “Advanced Editor”.

let
    Source = Csv.Document(Web.Contents("http://ichart.finance.yahoo.com/table.csv?s=CPU.AX&a=00&b=4&c=2014&d=07&e=01&f=2015&g=d&ignore=.csv"),[Delimiter=",",Encoding=1252]),
    #"First Row as Header" = Table.PromoteHeaders(Source),
    #"Changed Type" = Table.TransformColumnTypes(#"First Row as Header",{{"Date", type date}, {"Open", type number}, {"High", type number}, {"Low", type number}, {"Close", type number}, {"Volume", Int64.Type}, {"Adj Close", type number}})
in
    #"Changed Type"

Ah, now we get our first glimpse of M. The Web import wizard created a set of M code that queried the URL, determined the type of data returned (csv), determined the first row is headers and had a guess at the column types of the data and returned that data as a Table.

Breaking down the code:

  • A Query starts with (a very BASIC looking) “let”
  • A set of case sensitive statements (or Steps) that are separated by commas
  • Variables look like #”variablename” are created by assignment and adopt the returned type (much like Powershell a var in C#)
  • M is optimised for speed and therefore uses short circuit evaluation; that is a statement line wont execute if it isn’t required for the end value.
  • A Query end in an “in” statement to represent the data returned from the query.

Load

Now choose Close and Load To…. This is the point where Power Query tips into Excel or other tool you may be using to display data. The choice here is either to load the query into an Excel Table the PowerPivot Data Model or simply save the code into a stored query “Connection” for later use.

Choose Table and you should end up with something like this

Function

Now wouldn’t it be good if we could parameterise that query so it could be used for other stocks and dates. We can upgrade that original query to be a function. Open the Advanced Editor again on the query and overwrite with this.

let
     historical= (symbol as text, days as number) =>
let
    #"toDate" = DateTime.LocalNow(),
    #"fromDate" = Date.AddDays(#"toDate", 0-days),
    #"fromDateString" = "&a=" & Text.From(Date.Month(fromDate)-1) & "&b=" & Text.From(Date.Day(fromDate)) & "&c=" & Text.From(Date.Year(fromDate)),
    #"toDateString" = "&d=" & Text.From(Date.Month(toDate)-1) & "&e=" & Text.From(Date.Day(toDate)) & "&f=" & Text.From(Date.Year(toDate)),
    Source = Csv.Document(Web.Contents("http://ichart.finance.yahoo.com/table.csv?s=" & symbol & #"fromDateString" & #"toDateString" & "&ignore=.csv"),[Delimiter=",",Encoding=1252]),
    #"SourceHeaders" = Table.PromoteHeaders(Source),
    #"Typed" = Table.TransformColumnTypes(#"SourceHeaders",{{"Date", type date}, {"Open", type number}, {"High", type number}, {"Low", type number}, {"Close", type number}, {"Volume", Int64.Type}, {"Adj Close", type number}})
in
    #"Typed"
in
    historical

First thing to note is the code structure now has a query in a query. The outer query defines a function which injects some variables into the inner query. We’ve parameterised the stock symbol, and the number of days of history. We also get to use some of the useful M library functions as defined in here Power Query Formula Library Specification to manipulate the dates.
Now rename the new “function” as fHistorical and choose Home=>Close and Load.

fHistorical

At this point it’s easy to test the function using the Invoke command which will prompt for the parameters. To execute it from Excel we’ll provide the data from Excel cells. In Excel use Insert=>Table to insert a table with two columns named Symbol and Days and name the table “HistoricalParameters”. Note we have just created an Excel Table which is different to a Power Query Table, although you can use a Power Query Table to load data into and read data from an Excel Table (confused yet?).

Now it’s time for some more M. Move the Excel cursor down a few Cells to a clear area on the Worksheet. Then New Query=>Other Sources=>Blank Query and paste in this code.

let
    Source = Excel.CurrentWorkbook(){[Name="HistoricalParameters"]}[Content],
    #"Symbol"=Source{0}[Symbol],
    #"Days"=Source{0}[Days],
    #"Historical" = fHistorical(#"Symbol",#"Days")
in
    #"Historical"

This will take the HistoricalParameters table we created in Excel, pull out the data from the Symbol and Days columns. And use that to call our fHistorical function. Close and Load that query into an Excel table called Historical. I put both the Historical and the HistoricalParameters tables on the same Worksheet so you can update the Symbol and the Days and see the result. It looks like this. To get the query to run with new data either hit the Data Refresh button in Excel or Right Click and Refresh on the Historical table.

Note once you have data in an Excel Table like this, it’s easy to add additional calculated columns into the table auto update when the query does.

Formula Firewall

One of the areas that can be confusing is the concept of Privacy Levels. Power Query is built with sharing of queries in mind. Users can share queries with others but a Data Source must be declared with its privacy level to ensure that data from secure sources aren’t combined with insecure sources. This is done via the very fancy sounding Formula Firewall.

You can control the privacy level of every data source or but when developing a query the restriction imposed by the firewall can be confusing and can lead to the dreaded error message.

"Formula.Firewall: Query 'QueryName' (step 'StepName') references other queries or steps and so may not directly access a data source. Please rebuild this data combination."

So for now I’m going to turn off the Formula Firewall.

Use the Excel Query top level menu and click the Edit button. This will open the Power Query Editor window. Now go to File=>Options and Settings=>Query Options and click on Privacy. Set the Ignore Privacy Levels option.

Portfolio Query

OK now onto something more complex, a portfolio. The Quote Yahoo URL takes two parameters, s and f. “s” is a comma delimited list of stock symbols and f is a list of fields to return. There are actually 90 fields which can be returned for a stock by concatenating the list of alphanumeric codes for the fields. What would be good is to create a query that can request a number of stocks for a user defined set of columns. The set of codes is well described in this page so let’s take that data and add it into an Excel worksheet and make it into an Excel Table by highlighting the cells and select Insert=>Table which will convert it to an Excel Table and rename it “FieldLookup”. Now rename columns and add a “Type” and “Display” column. So you end up with something like this.

The “Display” column is used to determine which fields are to be queried and displayed. The “Type” column is used to encourage Excel to use the right column type when displaying (everything will still work using the default Excel column type, but Numeric and Date columns wont sort nor filter properly).

Now in a new Excel sheet and add table called Portfolio with a single column Symbol and add some stock code symbols. It is this table which will be used to fill with data for the required stocks and using all of the required display fields.

Now to create a new Query

let
    PortfolioSource = Excel.CurrentWorkbook(){[Name="Portfolio"]}[Content],
    #"sParam" = Text.Combine(Table.ToList(Table.SelectColumns(PortfolioSource, "Symbol")),","),

    #"Fields" = Excel.CurrentWorkbook(){[Name="FieldLookup"]}[Content],
    #"DisplayFields" = Table.SelectRows(#"Fields",each [Display]=true),
    #"fParam" = Text.Combine(Table.ToList(Table.SelectColumns(#"DisplayFields", "f")),""),

    #"DisplayColumns" = Table.ToList(Table.SelectColumns(#"DisplayFields", "Name")),
    #"TypeNames" = Table.SelectRows(Table.SelectColumns(#"DisplayFields", {"Name","Type"}), each [Type]="number"),
    #"ColumnTypes" = Table.AddColumn( #"TypeNames", "AsType", each type number),
    #"ColumnTypesList" = Table.ToRows( Table.SelectColumns(#"ColumnTypes",{"Name","AsType"})),

    #"YahooSource"= Csv.Document(Web.Contents("http://download.finance.yahoo.com/d/quotes.csv?s=" & #"sParam" & "&f=" & #"fParam"), #"DisplayColumns"),
    #"TypedYahooSource" = Table.TransformColumnTypes(#"YahooSource",  #"ColumnTypesList")
in
   #"TypedYahooSource"

Here’s what is happening

  • PortfolioSource: Get the Portfolio table and take just the Symbol column to create a single column Table
  • #”sParam”: Now combine all the rows of that table with commas separating. This will be the “s” parameter of the yahoo Quotes request
  • #”Fields”: Now get the table content from the FieldLookup table
  • #”DisplayFields”: Select just the rows that have a Display column value of true. Here’s we see the first use an inline function executed for every row using the “each” keyword (a bit like a C# Linq query).
  • #”fParam”: Combine all the rows of #”DisplayFields” with no separator to create the “f” parameter.
  • #”DisplayColumns”: Choose the Name column from the table to be used later
  • #”TypeNames”: Choose the Name and Type column where the type is a number
  • #”ColumnTypes”: Add a new column of type number. This is the first time we see the type system of M there are all the simple types and some additional ones for Record, List and Table etc.
  • #”ColumnTypesList”: Create a list of lists where each list item is a column name and the number type
  • #”YahooSource”: Now make the query to yahoo finance with our preprepared “s” and “f” parameters
  • #”TypedYahooSource”: Run through the returned table and change the column types to number for all required number columns

Now Close and Load the Query. This will generate a query like this.

http://download.finance.yahoo.com/d/quotes.csv?s=BKN.AX,MRM.AX,CAB.AX,ARI.AX,SEK.AX,NEC.AX&f=m4m3c1m5m7p2er1j4e7ql1nt8r5rr2p6r6r7p1p5s6s7st7yvkj

And a Portfolio table like this

Every column that has a “true” in the Display column of FieldLookup table will get a column returned with data. To add and remove stocks from the portfolio, just add, remove or edit them right there in the Symbol column of the Portfolio table. Hit Refresh and the data is updated!

Heres the final spreadsheet YahooFinanceStocks

As I was feeling around PowerQuery and M I’m left with a nagging feeling. Do we need another language? And another development environment, without debugging, intellisense, or syntax highlighting? Definitely not!

M is reasonably powerful but seems to have snuck into the Microsoft stack before what I call the “open era”. That is Microsoft’s new model of embracing open source rather than competing against it. Which brings me to R. R can do all of that and more and there is much more community backing. R is now a language appearing across interesting new Microsoft products like Azure Machine Learning and even inside SQL Server.

So I’ll crack that nut next time.

Get Started with Docker on Azure

siliconvalve

The most important part of this whole post is that you need to know that the whale in the Docker logo is officially named “Moby Dock“. Once you know that you can probably bluff your way through at least an introductory session on Docker :).

It’s been hard to miss the increasing presence of Docker, particularly if you work in cloud technology. Each of the major cloud providers has raced to provide container services (Azure, AWS, GCE and IBM) and these platforms see benefits in the higher density hosting they can achieve with minimal changes to existing infrastructure.

In this post I’m going to look at first steps to getting Docker running in Azure. There are other posts about that will cover this but there are a few gotchas along the way that I will cover off here.

First You Need a Beard

Anyone…

View original post 874 more words

Do It Yourself Cloud Accelerator – Part III Scaling Out

There’s recently been some interest in the space of accelerating Office 365 SharePoint Online traffic for organisations and for good reason. All it takes is a CEO to send out an email to All Staff with a link to a movie hosted in SharePoint Online to create some interest in better ways to serve content internally. There are commercial solutions to this problem, but they are, well… commercial (read expensive). Now that the basic functionality has been proven using existing Windows Server components, what would it take to put into production? A quick refresh; in the last post I demonstrated an cloud routing layer, that took control of the traffic and added in compression and local caching, both at the router and also at the end point by utilising BranchCache to share content blocks between users. That’s a pretty powerful service and can provide a significantly better browsing experience for sites with restricted or challenged internet connections. The router can be deployed either in a branch office, within the corporate LAN or as an externally hosted service.

Deployment

The various deployment options offer trade-offs between latency and bandwidth savings which are important to understand. The Cloud Accelerator actually provides four services:

  • SSL channel decryption, and re-encryption using a new certificate
  • Compression of uncompressed content
  • Generation of BranchCache content hashes
  • Caching and delivery of cached content

For the best price performance consider the following

  • Breaking the SSL channel, compression and generating BranchCache hashes are all CPU intensive workloads, best done on cheap CPUs and on elastic hardware.
  • Delivering data from cache is best done as close to the end user as possible

With these considerations in mind we end up with an architecture where the Cloud Accelerator roles are split across tiers in some cases. First an elastic tier runs all the hard CPU grunt work during business hours and scales back to near nothing when not required. Perfect for cloud. This, with BranchCache enabled on the client is enough for a small office. In a medium sized office or where the workforce is transient (such as kiosk based users) it makes sense to deploy a BranchCache Server in the branch office to provide a permanent source of BranchCache content enabling users to come and go from the office. In a large office it makes sense to deploy the whole Cloud Accelerator inline again with the BranchCache server role deployed on it. This node will re-break the SSL channel again and provide a local in-office cache for secure cacheable content, perfect for those times when the CEO sends out a movie link!

Scaling Out

The Cloud Accelerator I originally used is operating on a single server in Azure Infrastructure as a Service (IaaS). The process of generating BranchCache hashes on the server as the content passes through is CPU intensive and will require more processors to cope with scale. We could just deploy a bigger server but it’s a more economical solution (and a good opportunity) to use the Auto Scaling features within Azure IaaS offering. To create a multi-server farm requires a couple of other configuration changes to ensure the nodes work together effectively. Application Request Routing Farm One thing that’s obvious when working with ARR, is that it’s built to scale. In fact it is running underneath some very significant content delivery networks and is integral to such services as Azure WebSites. Configuration settings in ARR reveal that it is designed to operate both as part of a horizontally scaled farm and also as a tier in a multi layered caching network. We’ll be using some of those features here. First the file based request cache that ARR uses is stored on the local machine. Ideally in a load balanced farm, requests that have been cached on one server could be used by other servers in the farm. ARR supports this using a “Secondary” cache. The Primary cache is fast local disk but all writes go asynchronously to the Secondary cache too and if content is not found on the Primary then the Secondary is checked before going to the content source. To support this feature we need to attach a shared disk that can be used across all farm nodes. There are two ways to do this in Azure. One way is to attach a disk to one server and share it to the other servers in the farm via the SMB protocol. The problem with this strategy is the server that acts as the file server is then a single point of failure for the farm which requires some fancy configuration to get peers to cooperate by hunting out SMB shares and creating one if none can be found. That’s an awful lot of effort for something as simple as a file share. A much better option is to let Azure handle all that complexity by using the new Azure Files feature (still in Preview at time of writing)  Azure Files After signing up for Azure Files, new storage accounts created in a subscription carry an extra “.file.core.windows.net” endpoint. Create a new storage endpoint in an affinity group. The affinity group is guidance to the Azure deployment fabric keep everything close together. We will use this later to put our virtual machines in to keep performance optimal.

New-AzureAffinityGroup -Name o365accel -Location “Southeast Asia” New-AzureStorageAccount -AffinityGroup o365accel -StorageAccountName o365accel -Description “o365accel cache” -Label “o365accel”

Download the Azure Files bits from here taking care to follow instructions and UnBlock the PowerShell zip before extracting it and then execute the following: # import module and create a context for account and key import-module .\AzureStorageFile.psd1 $ctx=New-AzureStorageContext <storageaccountname> <storageaccountkey> # create a new share $s = New-AzureStorageShare “Cache” -Context $ctx This script tells Azure to create a new fileshare called “Cache” over my nominated storage account. Now we have a file share running over the storage account that can be accessed by any server in the farm and they can freely and safely read and write files to it without fear of corruption between competing writers. To use the fileshare, one further step is required; to attach it to windows, and there is a bit of a trick here. Attaching mapped drive fileshares is a user based activity rather than a system based one. So we need to map the drive and set the AppPool to run as that user.

  • Change the Identity of the DefaultAppPool to run as Network Service

 

  • Create a folder called C:\Batch and grant Write access for Network Service account (this is so the batch file can out put success/failure results)
  • Create a batch file C:\Batch\MapDrive.bat which contains cmdkey /add:o365accelcache.file.core.windows.net /user:<storageaccountname> /pass:<storageaccountkey>
  • Now set up a Scheduled Task to run in the C:\Batch directory run as “Network Service” at System Start Up to run the MapDrive.bat command. This will add the credentials for accessing the file share

 

MapDrive

  • Now we just need to reconfigure the ARR to use this new location as a secondary cache. Start IIS and Add a Secondary drive like this:

  • And reboot to kick off the scheduled task and restart the AppPool as the Network Service account.

BranchCache Key

A BranchCache Content Server operates very well as a single server deployment but when deployed in a farm the nodes need to agree on and share the key used to hash and encrypt the blocks sent to clients. If reinstalling the servers independently this would require the export (Export-BCSecretKey) and import (Import-BCSecretKey) across the farm of the shared hashing key. However, in this case it’s not necessary because we are going to make the keys the same by another method. Cloning all the servers from a single server template.

Preparation

In preparation for creating a farm of servers the certificates used to encrypt the SSL traffic will need to be available for reinstalling into each server after the cloning process. The easiest way to do this is to get those certificates onto the server now and then we can install and delete them as server instances are created and deployed into the farm.

  • Create a folder C:\Certs and copy the certs into that folder

We’ll provision a set of machines from the base image and configure them using remote PowerShell. To enable remote PowerShell log onto our server and at an elevated PowerShell window (Run As Administrator) execute the following:

  • Enable-PSRemoting -Force

Cloning

Azure supports the cloning of existing machines through “sysprep”ing and capturing the running machine that has been preconfigured.

  • Remote Desktop connect to the Azure virtual server and run C:\Windows\System32\Sysprep\sysprep.exe

This will shut down the machine and save the server in a generic state so new replica servers can be created in its form. Now click on the server and choose Capture which will delete the virtual machine and take that snapshot.

The image will appear in your very own Virtual Machine gallery, right alongside the other base machines and pre built servers. Now create a new Cloud Service in the management console or PowerShell ensuring to choose the new Affinity Group for it to sit within. This will keep our servers, and the fileshare deployed close together for performance.

New-AzureService -AffinityGroup o365accel -ServiceName o365accel

Now we can create a farm of O365 Accelerators which will scale up and down as load on the services does. This is a much better option than trying to guess the right sized hardware appliance from a third party vendor)

Now we could just go through this wizard a few times and create a farm of O365Accelerator servers, but this repetitive task is a job better done by scripting using Azure Powershell cmdlets. The following script does just that. It will create n machine instances and also configure those machines by reinstalling the SSL certificates through a Remote Powershell session. https://gist.github.com/petermreid/816e6a6e4aced5f6d0e8

After running this script multiple server instances will be deployed into the farm all serving content from a shared farm wide cache.

As a final configuration step, we use the Auto Scale feature in Azure to scale back the number of instances until required. The following setup uses the CPU utilisation on the servers to adjust up and down the number of servers from an idle state of 1 to a maximum of 4 servers thereby giving a very cost effective cloud accelerator and caching service which will automatically adjust with usage as required. This configuration says every 5 minutes have a look at the CPU load across the farm, if its greater than 75% then add another server, if its less than 50% remove a server.

Try doing that with a hardware appliance! Next we’ll load up the service and show how it operates under load and delivers a better end user experience.

Do It Yourself Cloud Accelerator – Part II BranchCache

In the last post I introduced the idea of breaking the secure transport layer between cloud provider and employee with the intention to better deliver those services to employees using company provided infrastructure.

In short we deployed a server which re-presents the cloud secure urls using a new trusted certificate. This enables us to do some interesting things like provide centralised and shared caching across multiple users. The Application Request Routing (ARR) module is designed for delivering massively scalable content delivery networks to the Internet which when turned on its head can be used to deliver cloud service content efficiently to internal employees. So that’s a great solution where we have cacheable content like images, javascript, css, etc. But can we do any better?

Yes we can and it’s all possible because we now own the traffic and the servers delivering it. To test the theory I’ll be using a SharePoint Online home page which by itself is 140K and overall the total page size with all resources uncached is a whopping 1046K.

Compression

Surprisingly when you look at a Fiddler trace of a SharePoint Online page the main page content coming from the SharePoint servers is not compressed (The static content, however, is) and it is also marked as not cacheable (since it can change each request). That means we have a large page download occurring for every page which is particularly expensive if (as many organisations do) you have the Intranet home page as a default on the browser opening.

Since we are using Windows Server IIS to host the Application Request Router we get to take a free ride on some of the other modules that have been built for IIS like, for instance, compression. There are two types of compression available in IIS, static compression which can be used to pre-calculate the compressed output of static files, or dynamic compression which will compress the output of dynamically generated pages on the fly. This is the compression module we need to compress the home page on the way through our router.

Install the Dynamic Compression component of the Web Server(IIS) role

Configuring compression is simple, firstly, make sure the IIS Server level has Dynamic Compression enabled and also the Default Web Site level

By enabling dynamic compression we are allowing the Cloud Accelerator to step in between server and client and inject gzip encoding on anything that isn’t already compressed. On our example home page the effect is to reduce the download content size from a whopping 142K down to 34K

We’ve added compression to uncompressed traffic which will help the experience for individuals on the end of low bandwidth links, but is there anything we can do to help the office workers?

BranchCache

BranchCache is a Windows Server role and Windows service that has been around since Server 2008/Win 7 and despite being enormously powerful has largely slipped under the radar. BranchCache is a hosted or peer to peer file block sharing technology much like you might find behind torrent style file sharing networks. Yup that’s right, if you wanted to, you could build a huge file sharing network using out of the box Windows technology! But it can be used for good too.

BranchCache operates deep under the covers of Windows operating systems when communicating using one of the BranchCache-enabled protocols HTTP, SMB (file access), or BITS(Background Intelligent Transfer Service). When a user on a BranchCache enable device accesses files on a BranchCache enabled file server or accesses web content on a BranchCache enabled web server the hooks in the HTTP.SYS and SMB stacks kick in before transferring all the content from the server.

HTTP BranchCache

So how does it work with HTTP?

When a request is made from a BranchCache enabled client there is an extra header in the request Accept-Encoding: peerdist which signifies that this client not only accepts normal html responses but also accepts another form of response, content hashes.

If the server has the BranchCache feature enabled it may respond with Content-Encoding: peerdist along with a set of hashes instead of the actual content. Here’s what a BranchCache response looks like:


Note that if there was no BranchCache operating at the server a full response of 89510 bytes of javascript would have been returned by the server. Instead a response of just 308 bytes was returned which contains just a set of hashes. These hashes point to content that can then be requested from a local BranchCache or even broadcast out on the local subnet to see if any other BranchCache enabled clients or cache host servers have the actual content which corresponds to those hashes. If the content has been previously requested by one of the other BranchCache enabled clients in the office then the data is retrieved immediately, otherwise an additional request is made to the server (with MissingDataRequest=true) for the data. Note that this means some users will experience two requests and therefore slower response time until the distributed cache is primed with data.

It’s important at this point to understand the distinction between the BranchCache and the normal HTTP caching that operates under the browser. The browser cache will cache whole HTTP objects where possible as indicated by cache headers returned by the server. The BranchCache will operate regardless of HTTP cache-control headers and operates on a block level caching parts of files rather than whole files. That means you’ll get caching across multiple versions of files that have changed incrementally.

BranchCache Client Configuration

First up note that BranchCache client functionality is not available on all Windows versions. BranchCache functionality is only available in the Enterprise and Ultimate editions of Windows 7/8 which may restrict some potential usage.

There are a number of ways to configure BranchCache on the client including Group Policy and netsh commands, however the easiest is to use Powsershell. Launch an elevated Powershell command window and execute any of the Branch Cache Cmdlets.

  • Enable-BCLocal: Sets up this client as a standalone BranchCache client; that is it will look in its own local cache for content which matches the hashes indicated by the server.
  • Enable-BCDistributed: Sets up this client to broadcast out to the local network looking for other potential Distributed BranchCache clients.
  • Enable-BCHostedClient: Sets up this client to look at a particular static server nominated to host the BranchCache cache.

While you can use a local cache, the real benefits come from distributed and hosted mode, where the browsing actions of a single employee can benefit the whole office. For instance if Employee A and Employee B are sitting in the same office and both browse to the same site then most of the content for Employee B will be retrieved direct from Employee A’s laptop rather than re-downloading from the server. That’s really powerful particularly where there are bandwidth constraints in the office and common sites that are used by all employees. But it requires that the web server serving the content participates in the Branchcache protocol by installing the BranchCache feature.

HTTP BranchCache on the Server

One of the things you lose when moving to the SharePoint Online service from an on premises server is the ability install components and features on the server including BranchCache. However, by routing requests via our Cloud Accelerator that feature is available again simply by installing the Windows Server BranchCache Feature.

With the BranchCache feature installed on the Cloud Accelerator immediately turns the SharePoint Online service into a BranchCache enabled service so the size of the Body items downloaded from to the browser goes from this:

To this:

So there are some restrictions and configuration. First, you won’t normally see any peerdist hash responses for content body size less than 64KB. Also you’ll need a latency of about 70mS between client and server before BranchCache bothers stepping in. Actually you can change these parameters but it’s not obvious from the public API’s. The settings are stored at this registry key (HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\PeerDistKM\Parameters) which will be picked up next time you start the BranchCache service on the server. Changing these parameters can have a big effect on performance and depend on the exact nature of the bandwidth or latency environment the clients are operating in. In the above example I changed the MinContentLength from the default 64K (which would miss most of the content from SharePoint) to 4K. The effect of changing the minimum content size to 4K is quite dramatic on bandwidth but will penalise those on a high latency link due to the multiple requests for many small pieces of data not already available in your cache peers.

The following chart shows the effect of our Cloud Accelerator on the SharePoint Online home page for two employees in a single office. Employee A browses to the site first, then Employee B on another BranchCache enabled client browses to the same page.

Where:

  • Office 365: Out of the box raw service
  • Caching: With caching headers modified by our Cloud Accelerator
  • Compression: With compression added to dynamic content (like the home page)
  • BranchCache 64K: With BranchCache enabled for >64K data
  • BranchCache 4K: With BranchCache enabled for >4K data

So while adopting a cloud based service is often a cost effective solution for businesses, if the result negatively impacts users and user experience then it’s unlikely to gather acceptance and may actually be avoided in preference for old on-premises habits like local files shares and USB drives. The Cloud Accelerator gives us back the ownership of the traffic and the ability to implement powerful features to bring content closer to users. Next post we’ll show how the accelerator can be scaled out ready for production.

Do It Yourself Cloud Accelerator

This intro is unashamedly lifted from a Microsoft article but I couldn’t say it any better: “The cloud has enormous potential to reduce operational expenses and achieve new levels of scale, but moving workloads away from the people who depend on them can increase networking costs and hurt productivity. Users expect high performance and don’t care where their applications and data are hosted” Cloud is a journey, to get there takes more than just migrating your workloads to the cloud. In most organisations architecture, infrastructure and networking have grown around the assumption that business systems are internal and the internet is external. Big WAN connections to offices and well protected interfaces to the Internet is the default networking solution to that situation.

Who moved my Cheese?

So what happens when business systems move to the Internet? The need for branch offices to communicate with head office diminishes but the need to communicate efficiently with the Internet becomes more important. The paradox of this situation is employees are likely to be much better served by their own internet connection when accessing the cloud based business services than by a filtered, sanitised and latent connection provided by work. Cloud nirvana is reached when there are no internal business systems, WAN connections are replaced by managed devices and independent internet connections attached to employees. In fact a true end state for cloud and BYOD may be that device and connectivity are deferred to the employee rather than a service provided by the company.

Do It Yourself Cloud Accelerator

3 Stages of Cloud Enlightenment

So with that end state in mind its worth considering the situation most business find themselves in today. “Cloud services are upon us but we have a classic enterprise network”. They may have migrated to Office 365 for Internet and Email but still have a big centralised internet connection, heavy dependency on WAN links and treat the whole Internet, including the new migrated Email and SharePoint services as one big untrusted Internet. It is an interesting problem.  Cloud services (like Office 365) relieve the organisation of the concerns of delivering the service but that is matched equally by a loss of control and ownership which affects the way the service can be delivered to internal employees. On using the cloud service a secure tunnel is set up between the cloud service provider and the employee which disintermediates the employer who is paying the bills! The cloud service provider owns the ip addresses, the URL and the security certificate used to encrypt the traffic which is a significant change from the highly flexible on-premises model where the organisation controlled everything. The underlying problem here is that the SSL used to encrypt traffic between service provider and a single user is not flexible enough to support the new 3 way trust relationship that occurs when an organisation uses a cloud provider to deliver services to employees.

Take back the streets

All is not lost. Just because you’ve accepted cloud into your life doesn’t mean you have to totally submit to it. Cloud service providers do a great job using caching services such as Akamai to drop content right at the organisational door step, but the ability to efficiently transport that data through internal networks to the employee is now harder. In the following blog we’ll look at a strategy for taking back ownership of the traffic and deploying an efficient and cheap internal CDN network to serve content to branch offices with no direct internet connection. While the example is centered on Office 365 SharePoint Online, it is equally applicable to any cloud based service. There are commercial solutions to the problem by vendors like Riverbed and CISCO who will, by dropping an expensive device on your network, “accelerate” your Office 365 experience for internal employees. But can you do that with out-of-the-box software and commodity hardware? Yes you can. To build our own Cloud Accelerator we need the following:

  • 1 Windows Server (can be deployed internally or externally)
  • Access to makecert.exe through one of the Windows development kits
  • 1 Office 365 SharePoint Online Service (tenant.sharepoint.com) and an account to test with

Application Publisher

By using the Fiddler tool when browsing to your cloud application will reveal where the content is actually coming from. For our SharePoint Online site it looks something like this. https://gist.github.com/petermreid/8d2567b4703daff35ae2 Note that the content is actually coming from a mixture of the SharePoint Online server and a set of CDN URLs (cdn.sharepointonline.com and prod.msocdn.com). That is because Microsoft make heavy use of caching services like Akamai to offload much of the content serving and deliver it close to the organizational internet links as possible. The other thing to notice is the size. In fact the total size of the home page of our Intranet (without any caching) is a whopping 1MB which may be a significant burden on internet network links if not cached adequately across users. But to be able to do that we need to step in the middle to see the traffic and what can be cached on intermediate servers. Microsoft have a trusted certificate authority and can create secure certificates for *.sharepoint.com and *.sharepointonline.com which enables them to encrypt the traffic for your tenant’s Office 365 traffic. When you browse to Office 365 SharePoint Online from your browser an encrypted tunnel is set up between the browser and Microsoft but we can take back ownership of the traffic if we need to break the secure tunnel and republish the traffic using internal application publication.

Do It Yourself Cloud Accelerator App Publish

Application Publication

To do this we need to create our own certificate authority and deliver our own tenant.sharepoint.com and cdn.sharepointonline.com certificates and tell our users to trust those certificates by putting the new root certificate in their trusted root certificate store. This is no different to the trick that corporate proxy servers use to enable content filtering of your secure browser traffic. But here we don’t need to sniff all traffic, just the traffic that we already own coming from Office 365 which means we can pre create the certificates rather than generate them on the fly.

Trust Me

The first step is to create a new set of certificates that represent the trust between organization and employee. There’s a great article describing the process here but we’ll run through the steps here (replace the work tenant for your SharePoint Online tenant)

  • First find your copy of makecert.exe, mine is here C:\Program Files (x86)\Windows Kits\8.1\bin\x64.
  • Create a new certificate authority (or you can use one that your organisation already has). Mine will be called “Petes Root Authority” and it is from this root that we will build the new domain specific certificates.

Then use the certificate authority (stored in CA.pvk and CA.cer) to create a new certificate for tenant.sharepointonline.com

And one for the CDN domain cdn.sharepointonline.com https://gist.github.com/petermreid/4a2bee1b3b90dfef39d0

Following those instructions you should end up with a Certificate Authority and a server certificate for each of tenant.sharepoint.com and cdn.sharepointonline.com which we will install on our Office 365 Application Publisher. This is classic man-in-the-middle attack, where the man is the middle is you!

Application Publication

To publish the cloud service internally we will use a Windows Server and Internet Information Services (IIS) equipped with Application Request Routing (ARR). Normally this set up is used to publish content to the Internet but we will use it the other way around, to publish the Internet internally. The server can be anywhere, in the cloud or on-prem or even in a branch office but it just needs to have the opportunity to intercept our browsing traffic.

IIS Install

In the Server Manager->Local Server

  • IE Enhanced Security Configuration set to Off for Administrators
  • Manage->Add Roles and features->Role based->Add the Web Server(IIS) role.
  • Click through the “Features”
  • Click “Web Server Role”
  • Click on “Role Services”
  • Click Health and Diagnostics->Tracing (used to debug issues)

  • Click Install
  • Close the dialog when complete

Certificates

Install the certificates we created.

  • Install the CA into Trusted Root Certificate Authorities

  • And the server certificates into the WebHosting store by double clicking on each .pfx file,
  • choosing Store Location ->Local Machine and
  • Place all certificates in the following store ->Web Hosting

Click Start and type “Manage Computer Certificates”  and choose the Web Hosting branch to see the installed certificates

Web Platform Installer

  • Click Start and type IIS Admin and enter
  • Click on the machine name
  • Accept the dialog to install the Web Platform Installer and follow steps to download, run and install
  • Once installed, Click products and search for “Application Request Routing”
  • Then use the platform installer to install “Application Request Routing 2.5 with KB2589179”
  • Accept the Pre-requisites install and click Install and then Finish

  • Close IIS Admin and reopen using Start->IIS admin

Application Request Routing

  • Click on the top level server name and double click on “Application Request Routing Cache” icon
  • Under “Drive Management” select Add Drive and choose a new empty drive to store content in
  • Under “Server Proxy Settings” check the “Enable proxy” box
  • Clear the check box “Reverse rewrite host in respose headers” (we just want it to pass content right through without any modification)
  • and click “Apply”

Web Site

  • Click on the “Default Web Site” choose “Explore” and deploy the following web.config to that directory.
  • This config file contains all the rules required to configure Application Request Routing and URL Rewrite to forward the incoming requests to the site and to send the responses back to the requestor.
  • There is an additional filter (locked down to sharepoint and cdn) to ensure the router doesn’t become an “open proxy” which can be used to mask attacks on public web sites.

Site Binding

The default site will listen for HTTP requests on port :80, additional bindings are required to serve the https traffic.

  • In IIS Admin Click Bindings…Add Site Binding
  • Choose https and add bindings for all the  certificates we created earlier (make sure to select Require Server Name Indication and type the corresponding host name so IIS knows which certificate to use for which request)

URL Rewrite

Just one simple rule drives it all. All it does is takes any incoming request and re-requests it! (Note browsing direct to the website will cause an infinite loop of requests which IIS kindly breaks after about 20 requests) Double click the URL Rewrite icon in IIS to see the rule represented in the GUI…

Or the same as seen in web.config…

Network Configuration

DNS

So now all we have to do is change the mapping of the URL to point to our web server instead of Microsoft’s one. If this is an organisation with an internal DNS then it is easy to repoint the required entries. Otherwise, and for testing we can use a local DNS override, the “hosts” file (this file is located at [Windows]\System32\drivers\etc) by adding the following entries. ipaddress.of.your.apppublisher tenant.sharepoint.com ipaddress.of.your.apppublisher cdn.sharepointonline.com (Note if you’ve done this using an Azure server then the ip address will be the address of your “Cloud Service” ) Now when you browse to tenant.sharepoint.com the traffic is actually coming via our new Application Publisher web site but the content is the same! If you want to check if the content is actually coming through the router, have a look at the content headers (using Fiddler) since ARR inserts an additional header X-Powered-By: ARR/2.5

But remember that any user who intends to use this accelerator to receive content must have the root certificate installed in their “Trusted Root Certificate Authorities). By installing that root the user is acknowledging that they trust the acceleration server to intercept traffic on their behalf.

Accelerate

So, that’s an awful lot of effort to get the same content on the same URL. But there is something subtly different, we now “own” the traffic and we can introduce speed, bandwidth and latency improving strategies. Out of the box the Application Request Routing module we used implements caching which will honour the cache control headers on the requests to provide an intermediate shared cache across your users of the service. Simply click on the top level web site, choose Application Request Routing Cache, Add a Drive that it can use for caching and it will immediately start delivering content from the cache (where it can) instead of going back to the source thereby providing immediate performance improvements and bandwidth savings. Now we “Own” the traffic there are many other smarter ways to deliver the content to users especially where they are isolated in branch offices and low bandwidth locations. I’ll be investigating some of those in the next blog which will turn our internal content delivery network up to 11!

Simulate moving to the Cloud with NEWT

I’ve blogged a bit in the past about the unique challenges encountered when moving to the cloud and the unavoidable consequence of introducing new network hops when moving workloads out of the data centre. I’m currently working for a unique organisation in the mining industry who are quite aggressively pursuing cost saving initiatives and have seen cloud as one of the potential savings. The uniqueness of the IT operating environment comes from the dispersed and challenging “branch offices” which may lie at the end of a long dedicated wired, microwave or satellite link

Centralising IT services to a data centre in Singapore is all very well if you’re office is on a well serviced broadband Internet link but what of these other data centres with more challenged connectivity. They have until now been well served by the Exchange server, file server or application server in a local rack and quite rightly want to know “what is the experience going to be like for my users if this server is at the other end of a long wire across the Pilbara.

Moving application workloads to best serve these challenging end user scenarios depends heavily on the application architecture. The mining industry, more so than other industries have a challenging mix of file based, database, and high end graphical applications that might not tolerate having tiers of the application spanning a challenging network link. To give the equivalent end user experience, do we move all tiers of the application and RDP the user interface, move the application server and deliver http, move just the database or files and leave the application on site?

Answering these questions is not easy, and depends on hard to measure qualitative user experience measures like scrolling, frame rate, opening times, report responsiveness and can only be determined once the workload has been moved out of the local data centre. What we need is a way to synthesise the application environment as if there are additional network challenges injected into the application stack without actually moving them.

What we need is a “Network Link Conditioner” available on Mac and UNIX environments but no Windows. Introducing NEWT or Network Emulator Toolkit for Windows. A product of the hardworking Microsoft Research team in Asia. And for such a powerful and important tool in demonstrating the effects of moving to cloud, it’s not so easy to find!

Installing

It’s been around for a while and there are a few odd web sites that still host copies of the application but the most official place I can find is in the new and oddly named Chocolatey gallery download site. Once you have Chocolatey installed it’s just a simple

So what can it do? It can help to simulate various network challenges by using a special, configurable network adaptor inserted into the network stack. After you install the application on a machine, try opening “Network and Sharing Centre” and then “Change adapter settings” and then right click “Properties” on one of your Network Adapters and you should see this.

A lap around NEWT

The documentation that comes with NEWT is actually very good so I won’t recover it here. There is particularly detailed information on the inner workings of the network pipeline and the effect of configuration options on it. Interestingly the NEWT user interface is just one of the interfaces as it also supports a .NET and COM interface also for building your own custom solutions.

Running

  • Add a Filter to one or All of your network adaptors and choose the ip addresses and or ports you want to affect
  • Add up and/or down Links and set the desired effect on bandwidth restriction, additional latency, packet loss and reordering.
  • Try Action->Toggle Trace, Push Play and have a look at the trace windows to see the packets being sniffed

To quickly check if it is working as expected try this ping google and note the latency (32mS) and ip address (74.125.237.209), set a Filter with 100mS of up & down link latency, Run NEWT and ping again

There’s now an additional 200mS (100 up and 100 down plus some NEWT filter overhead) added to Google home page. And you will notice it if you open a browser window to Google and hit Refresh (to refresh the browser cache). If you want to go back in time, try using the pre created filters like the “Dialup56K” setting! So enough of messing around what about the serious stuff.

So here are a few scenarios I’ve used it in recently:

Throttling Email

Why: When moving from an in house hosted exchange environment to Office 365 Email, email mail box synchronisation traffic must pass through corporate Internet facing networking infrastructure. Outlook (with the use of Cached Exchange mode) is actually extremely good at dealing with the additional latency, but some customers may be concerned about the slower email download rates now that Exchange is not on the same network (don’t know why since Email is inherently an asynchronous messaging system). To demonstrate the effect of slower downloads:

  • Download a file using the browser and note the download speed (this is to estimate the bandwidth available to an end user across the corporate Internet facing infrastructure)
  • Determine the ip address of the corporate exchange server and add a NEWT Filter for that ipaddress
  • Add a Downlink and limit the bandwidth to the previously noted download speed available to the browser (optionally also add the latency to the Office 365 data centre)
  • Send a large email to your Inbox

Migration to SharePoint Online

Why: When you have an existing Intranet served internally to internal users it can be difficult to describe what the effect will be of moving that workload to SharePoint Online. There are certainly differences but browsers are remarkably good at hiding latency and tools like OneDrive become more important. Demonstrating these effects to a customer to ease concerns about moving offshore is easy with NEWT.

  • Estimate the latency to the SharePoint Online data centre (Australia to Singapore is around 130mS each way)
  • Add a NEWT Filter for the internal Intranet portal
  • Add an Uplink and Downlink and add on your estimated Singapore latency
  • Browse and walk through the Intranet site showing where latency has an effect and the tools that have been built to get around it.

And remember for web applications like this your other favourite tool, Fiddler still as expected works as it operates at a different layer of the network stack.

Migrating an Application Database to the Cloud

Why: There are many cases where an application is deployed to local machines but communicate to a central database which may be worth considering moving to a cloud based database for better scalability, performance, backup and redundancy. Applications that have been built to depend on databases can vary significantly in their usage of the database. Databases are extremely good at serving needy applications which has encouraged inefficient development practices and chatty database usage. For this reason database bound applications can vary significantly in their ability to tolerate being separated from their database. NEWT can help synthesis that environment to determine if you’ve got an application that is tolerant or intolerant to a database move to the cloud.

  • Determine the connection string used by the application to connect to the database
  • Add a NEWT Filter for the ipaddress of the database on the database connection port (1433)
  • Inject as much latency as you need to simulate the distance between this desktop application and the intended location of the database.

Migrating an Application to the Cloud

Why: The question often comes up on lifting and shifting an entire application to the cloud along with all dependant tiers and present the application back to the user over RDP. Will an RDP interface be effective for this application?

  • Set up two machines, one with the application user interface (with remote desktop services enabled) and one with remote desktop
  • Inject the required latency using a NEWT filter on the application machine ipaddress and on the RDP port 3389.
  • Allow the user to operate the application using the synthesised latent RDP interface

There are many other possible uses, including synthesising the effects of packet loss and reordering on video streaming and its use in demonstrating the effect of network challenges on Lync or other VOIP deployments is invaluable. In a corporate IT environment where a cloud migration strategy is being considered NEWT is an invaluable tool to help demonstrate the effects of moving workloads, make the right decisions about those workloads and to help determine those interfaces between systems and people that have , and have not been designed and built to tolerate a network split across a large and unpredictable network.

Do It Yourself Fiddler Service

I recently upgraded to Windows 8.1 which required a full install (upgraded from the 8.1 Preview which annoyingly didn’t support upgrades). A full install of my laptop is getting easier and easier as more of the things I use are delivered as services. The install list is getting smaller due to the combined effect of software as a service and a simpler working life.

I still had to install these:

  • Microsoft Office 2013
  • Microsoft Visio 2013
  • Microsoft Project (yes yes I know but there really is no good alternative yet)
  • LastPass
  • Visual Studio 2013
  • And…Fiddler!

The install list from bare metal to productive machine is getting smaller and smaller (arguably Office is available online but slightly under featured for my liking). The world of development is moving online too. There is a significant move away from the personal development environment backed by shared source control and build machine to a development environment that is team first, always integrated and continuously saved to repository and built by default. The editing and debugging experience is moving online too. Take a look at jsFiddle for a window into the online development experience. I think we’ll see a lot more in this space and much stronger interaction with continuously integrated source code repositiories.

Fiddler

Anyway I digress, what I really wanted to highlight is Fiddler. That invaluable tool that injects itself as a proxy between your browser and the internet and even performs a man in the middle attack on your secure traffic. I use it all the time for everything from solving performance problems to reverse engineering web sites. But what does the Fiddler of the future look like? How can it be relevant in service oriented world where the PC is no longer the centre of your universe but just a dumb terminal? In a cloud and service oriented world there are a great deal of conversations happening between machines hosted remotely that we are interesting in listen in on. When developing a modern cloud solution which is aggregating a number of RESTful services, sometimes the most interesting conversations are happening not between browser and server but between machines in a data centre far-far away.

What I need is a Fiddler as a Cloud Service

SignalR

Alert readers from my previous post may have noticed something; there was a technology mentioned in the forward that I didn’t really use in the implementation of the WebAPI Proxy, “SignalR”. SignalR is one of those cool new technologies built over the WebSockets protocol that is a technological leap forward in Internet technologies and a game changer for implementing complex interactive websites that significantly augments the page request based model of the Internet we have been stuck with since the 80’s. Broadly speaking the WebSockets API enables a server to open, and maintain a connection to the browser which can be used to push information to the client without user interaction. This is much more like the traditional old thick client application model where TCP connections are held open for creating an interactive real-time user experience. While this technology will (and has) opened the door for a whole new style of web based applications; it invites back invites back all those scalability problems of managing client state which was so artfully dodged by the disconnected stateless model the Internet is built on today.

So what’s all this got to do with Fiddler?

What if we could deploy to the cloud a WebAPI Proxy as described in the previous blog but when passing through traffic, use Signal-R to feed that logging information back to a browser. Then we’d have something much like Fiddler but not installed on the local machine but rather as a service in the cloud; able to intercept and debug traffic of anyone choosing to use it as a proxy. Sounds good in theory but how to implement it?

Again as I demonstrated with the proxy, it’s relatively easy to plug in handlers into the pipeline to modify the behaviour of the proxy, and that includes putting in a logging handler to watch all traffic passing through. Something like this

protected
 override
 async System.Threading.Tasks.Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, System.Threading.CancellationToken cancellationToken)
 {
   //add a unique Id so we can match up request and response laterrequest.Properties["Id"] = System.Guid.NewGuid().ToString();
   //log the request
    await LogRequest(request);
   //start a timer
    Stopwatch stopWatch = Stopwatch.StartNew();
   //pass the request through the proxy 
    var response = await
    base.SendAsync(request, cancellationToken);
   //stop the timerstopWatch.Stop();
   //log the response and the time taken
    await LogResponse(request, response, stopWatch);
    return response;
 }

Now comes the interesting part, adding Signal-R from nuget adds a folder into your WebAPI project called Hubs. Now a Signal-R “Hub” is your opportunity to set up the relationship between an incoming browser request and an outgoing signal (or push) to the browser. It is in this hub area that you can add handlers for javascript functions on the client and use the Context.ConnectionId to identify the incoming request that can later be used to push out data to the same client.

Client side Javascript Server side Hub handler
 $("#start").click(function ()
 {
 loggingHub.server.start();
 });

 public
 void
 Start()
 {
  Trace.WriteLine("connection", Context.ConnectionId);
 }

So that’s pretty cool, we can call functions on the server from the client javascript (once you understand the conventions of properly naming to deal with language case differences), but then you could always call functions on the server from the client, that’s what HTTP is. The difference here is in the server side function “Start” there is a Context.ConnectionId, this is a GUID that represents the caller and can be used to later to asynchronously call the user back.

Fiddler as a Service

A service isn’t a service unless it’s multi-tenant. So what we’ll do is build the ability for a client to register interest in just some of the logs passing through the proxy. For simplicity we’ll just pick up the user’s ipaddress and push logs at them for anything we see running through the proxy from that ipaddress. That way we’ll get something similar to the functionality of the local Fiddler running on the local machine where we set the internet proxy for browsing and see the logs for anything running through the proxy on that machine.

Remember (from the previous post), the code written for this proxy is designed to be hosting platform agnostic; that is it can be run in IIS but can also be run in the new simple hosting platforms such as OWIN. But this throws up some interesting challenges when trying to sniff out HTTP constants like ipaddress which are represented through different APIs. So I use some code like this to stay generic.

public static string GetClientIp(this HttpRequestMessage request)
{
   if (request.Properties.ContainsKey("MS_HttpContext"))
   {
     return ((dynamic)request.Properties["MS_HttpContext"]).Request.UserHostAddress as string;
   }
   else if (request.Properties.ContainsKey("MS_OwinContext"))
   {
     return ((dynamic)request.Properties["MS_OwinContext"]).Request.RemoteIpAddress as string;
   }
   else if (request.Properties.ContainsKey(RemoteEndpointMessageProperty.Name))
   {


     RemoteEndpointMessageProperty prop;
     prop = (RemoteEndpointMessageProperty) request.Properties[RemoteEndpointMessageProperty.Name];
     return prop.Address;
   }
   else
   {
     throw new Exception("Could not get client IP");
   }
}

Next we need a page on the proxy that the client can request to register their interest in logs, this is called logging.html and apart from including the necessary signal-R javascript glue, just has a couple of buttons to Start and Stop logging and a list to accept the logs coming back from the browser. (Serving that static page was a bit of a challenge at the time of writing since there was no official Static File Handler for OWIN so I wrote my own.)

Now here’s where the interesting part of SignalR shows itself. There is some javascript on the page which defines a function called “logUrl and some code on the server which calls that function from the WebAPI handler. So whenever a request or response comes through the proxy we look up the ipaddress, determine which registered hub connection it maps to and call the appropriate client’s logUrl function which logs the data in the browser window.

Server side logging Client side logging Javascript
 private static async System.Threading.Tasks.Task LogRequest(HttpRequestMessage request)
 {
 string connection;
 if (connections.TryGetValue(request.GetClientIp(), out connection))
 {
   var loggingHubContext = GlobalHost.ConnectionManager.GetHubContext<LoggingHub>();
   await loggingHubContext.Clients.Client(connection).LogUrl(request.RequestUri.ToString());
 }
 }

 loggingHub.client.logUrl = function (url)
 {
  $('#logs').append('<li>' + url + '</li>');
 };

Anyway, what we should end up with is something that looks like this:

Busker

What better name for Fiddler as a Service but “Busker”? Busker Proxy is running as a WorkerRole in Azure. (I can’t use the free Azure WebSites because the proxy effectively serves traffic for any url but Azure WebSites uses host headers to determine which site to send traffic to.) While I’m still hosting the service for a limited time, try this:

Set your internet proxy to http://busker.cloudapp.net port:8080

Browse to http://busker.cloudapp.net/logging.html

Hit Start and then open a new browser window and browse to a (non secure) website like http://www.google.com. You should see something like this in your first browser window.

For now it’s an interesting debugging tool and could do with quite a bit more work. But more importantly it demonstrates how the technologies available to us as developers are now helping to bridge the gap between web client and server to provide better, more interactive software as a service from a remote, deployed, multi-tenant environment replacing the need to install applications locally other than a browser.

On the todo list for Busker:

  • There are some very interesting certificate tricks that Fiddler plays on your local machine to implement the man in the middle attack which will be quite hard to implement securely in a Fiddler as a Service
    (I might give that a go in a later blog).
  • The logging screen should be able to nominate an ipaddress rather than just using the one you come from so you can sniff traffic that doesn’t just originate from your machine
  • The logging view needs to properly match up requests and responses so you can see things like response time and outstanding requests

Check out the code here https://github.com/petermreid/buskerproxy

Let me know if there are any cool features I should consider adding.

Happy Busking

Do It Yourself Web API Proxy

I had promised a couple of blogs on dealing with the challenges of distance that are unavoidable as we adopt a variety of dispersed Cloud deployed services. The first was using a WCF Custom Channel to cache SharePoint content which is now a bit old school. This is the second.

The rate of change at the moment is astonishing. I’ve been sitting on blog number two for quite some time, but when I go to build an example and type it up, something new has come along that is a slightly better way of doing it. But then that’s why we are in this business, right? So it is with much haste that I blog this while it is still relevant, (I think it still is) or at the very least it is buzz word (Web API, SignalR, Owin and Katana) compliant.

Something in one of the previous blog comments got me thinking.
“Thanks peter for nice blog. It helps allot. I am planning to write Routing service which can route both SOAP and OData requests. I took your solution and tried to debug. I am getting empty string while reading message before sending to client.”

Turns out that while WCF is designed to be transport agnostic, it’s hard to get it to be message format agnostic. What would be good is something that doesn’t get in the way but sits off to the side of the messaging but can be used to get in and affect messages, like adding authentication, caching or some other functionality. What I need is a forward proxy, not a router.

From the great book “Succinctly” book series: A proxy server is a computer that sits between a client and server. A proxy is mostly transparent to end users. You think you are sending HTTP request messages directly to a server, but the messages are actually going to a proxy. The proxy accepts HTTP request messages from a client and forwards the messages to the desired server. The proxy then takes the server response and forwards the response back to the client. Before forwarding these messages, the proxy can inspect the messages and potentially take some additional actions.

Which got me wondering, what is it that happens when you tick that proxy box in your browser? Is there something special about those servers we use to browse traffic through?

So I built a simple webserver, set my local IIS as the proxy server and with some config and fiddling the IIS web server was receiving traffic on IIS when I was trying to browse sites on the Internet. OK, so what? What if we capture that traffic in a web service and send it on and hand back the replies to just as a proxy would?

Web API

Web API is great simple technology for building simple APIs that serve the HTTP protocol. How hard would it be to build a simple server using Web API that proxies traffic straight through? Turns out it’s easy. Way, way easier that in WCF! All you need to do is build a Web API DelegatingHandler like this:

public class ProxyHandler : DelegatingHandler
{
  protected override async System.Threading.Tasks.Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, System.Threading.CancellationToken cancellationToken)
  {
    UriBuilder forwardUri = new UriBuilder(request.RequestUri);
    //strip off the proxy port and replace with an Http port
    forwardUri.Port = 80;
    //send it on to the requested URL
r
equest.RequestUri = forwardUri.Uri;
    HttpClient client = new HttpClient();
    var response = await client.SendAsync(request,HttpCompletionOption.ResponseHeadersRead);
    return response;
}
}

It’s worth taking a moment to understand what this code is doing. Every request that comes in on the proxy port (by convention I use port 8080) change it to port 80 and make the request then return the response. Simple. The other thing to note and the big change over my previous WCF version is the use of async. This enables the proxy to listen for new requests while previous requests are still pending responses, perfect for a proxy. Next is to configure the proxy:

config.Routes.MapHttpRoute(name: “Proxy”, routeTemplate: “{*path}”handler: HttpClientFactory.CreatePipeline
  (
    innerHandler: new HttpClientHandler(), // will never get here if proxy is doing its job
    handlers: new DelegatingHandler[]
    { new ProxyHandler() }
  ),
  defaults: new { path = RouteParameter.Optional },
  constraints: null
);

For all requests ({*path}) use the default HttpClientHandler but put the ProxyHandler in the pipeline before it. All requests are actually handled by the ProxyHandler and should never fall through to the inner handler. There are couple of additional tricks to get it working in all cases. You’ll need this for some reason that I haven’t yet figured out but I think it’s a bug in Web API.

 //have to explicitly null it to avoid protocol violation
 if (request.Method == HttpMethod.Get)
   request.Content = null;

You’ll also need this in your web.config if you are going to host it in IIS (more on hosting options later) to tell IIS that you want the proxy code to handle all requests, no matter what!

<system.webserver>
<!–needed if you want to run in IIS and handle everything!–>
<handlers>
 <remove name=ExtensionlessUrlHandler-Integrated-4.0 />
<remove name=OPTIONSVerbHandler />
<remove name=TRACEVerbHandler />
<add name=ExtensionlessUrlHandler-Integrated-4.0” path=*.” verb=*” type=System.Web.Handlers.TransferRequestHandler” preCondition=integratedMode,runtimeVersionv4.0 />
</handlers>
</system.webServer>

And, most importantly, this will tell your proxy not to use a proxy when making requests to the Internet (to avoid an endless loop of proxy requests)

<system.net>
<defaultProxy enabled=false />
</system.net>

Proxy done!

Hosting

The above code is hosting platform agnostic. This is one of the areas undergoing a lot of change in the industry right now. In what appears to be a new era of openness, open source and interoperability (even by Microsoft!) there are now a range of ways to host web services. Long story short .NET Web != IIS which is great. IIS is a very big, cumbersome engine if you don’t really need it. And this is a perfect example, as a simple forward proxy we don’t want or need any of that web serving functionality getting in the way, in fact we had to put special configuration into IIS to get it to leave our requests alone.

Since Web API is not tied to the web hosting framework in .NET, it supports self-hosting which is useful for running cheap API’s in Console apps or WorkerRoles. But now there is another option, Owin. Owin defines an open API for building web applications over a hosting platform by delivering well defined pluggable architecture. Katana is Microsoft’s implementation of OWIN. This article has a great explanation of how we got to such a complicated place with the pair of ASP.NET and IIS and how we are now finally winding it back out again now with Owin, Katana and various other supporting frameworks.

Like anything so new, this area is still in flux. The APIs are a bit green and changing and some simple modules (like Static File delivery) are still in beta. But it’s a great idea, and we can use it right away for our proxy with just a simple line of code we can spin up the proxy in a Console app anywhere.


public
static
void Start(string proxyAddress)
{
  try
  {
   // Start OWIN proxy host
   apps.Add(WebApp.Start<Proxy>(proxyAddress));
   Trace.TraceInformation(“Listening on:” + proxyAddress);
   Trace.WriteLine(“Set your IE proxy to:” + proxyAddress);
  }

The abstraction of Owin also enables us to just as easily deploy into Azure WorkerRole, again I prefer using a worker role so we don’t get all that annoying complexity and default modules that come with IIS and a Web Role.

foreach (var rie in RoleEnvironment.CurrentRoleInstance.InstanceEndpoints)
{
 var endpoint = rie.Value;
 if (endpoint.Protocol == “http”)
 {
  string baseUri = string.Format(“{0}://{1}”, endpoint.Protocol, endpoint.IPEndpoint);
  Trace.TraceInformation(String.Format(“Starting OWIN at {0}”, baseUri));
  Proxy.Start(baseUri);
 }
}
OK so now we have built a proxy server which can be used to channel all requests from your machine through the Web API proxy which is pretty powerful. We can deploy the proxy to Azure in US, UK or Asia and use it to browse sites unavailable in Australia. Try it out yourself by setting your IE proxy to busker.cloudapp.net port 8080. [The choice of the name “Busker” will become apparent in the next blog entry]

And try browsing www.google.com to see the results through the proxy (my US based proxy makes Google think I’m in Kendale Lakes Florida)

and direct ( that’s me with the with the proxy off getting Australian results)

But while that’s fun for tricking Google or for viewing NetFlix we could do the same with a publically available proxy or by deploying a commercial proxy server (TMG, Squid etc) into a virtual machine in the Cloud. The difference and power is in how we can inject custom behaviours using the WebAPI pipeline to get the proxy to affect the traffic.

Domain constrained handling

The idea of a proxy is to handle all traffic no matter what the domain. A web site or Web API is targeted at providing many urls under a single domain and it shows in the frameworks. They have great url mapping and filtering capability but largely ignore the domain. What if we want our proxy to behave differently for different domains? The answer is in this line:

constraints: null

We can add a custom constraint to ensure a handler is only invoked if the constraint is satisfied and in the constraint we can implement any sort of filtering we like.

constraints: new { isLocal = new HostConstraint { Host = “table.core.windows.net” } }

The HostConstraint class has a single method Match() which does something simple like:

public
bool Match(HttpRequestMessage request, IHttpRoute route, string parameterName, IDictionary<string, object> values, HttpRouteDirection routeDirection)
{ return request.RequestUri.Host.Contains(Host);}

So now I can make the proxy plug in different handlers and therefore behave differently for different domains!

Domain Specific Handlers

Previously I demonstrated using a WCF Router to modify the Azure Table API and make it a true oData interface which can be consumed by Excel. To do it we needed to create the Azure Table Storage authentication tokens and inject the oData paging elements over the vanilla table.core.net service. That same functionality has been ported to the proxy and plugged in as another handler which is constrained to just requests to the Azure Table Service (table.core.windows.net)

//now plug in some AzureAuth config handling using a Controller
//http://backtester.table.core.windows.net/config?connectionstring=DefaultEndpointsProtocol=https;AccountName=backtester;AccountKey=

config.Routes.MapHttpRoute(
  name: “ConfigAzureAuth”,
  routeTemplate: “config”,
  defaults: new { controller = “ConfigAzureAuth” },
  constraints: new { isLocal = new HostConstraint { Host = “table.core.windows.net” } }
);

Try setting the browser proxy to busker.cloudapp.net:8080 and use a browser or Excel to browse to http://backtester.table.core.windows.net/Quote(). This account holds a table with some stock quotes in it. Without the proxy in play this would fail because the AzureTableHandler injects the necessary authentication headers and paging over the straight Azure table API. Or submit your own storage keys to the Proxy to browse your own table storage in a browser or Excel using this config url


http://backtester.table.core.windows.net/config?connectionstring=DefaultEndpointsProtocol=https;AccountName=backtester;AccountKey=

Note here something a bit unusual, it looks like the Azure url http://backtester.table.core.windows.net supports a url path “config” but actually it doesn’t the Proxy picks up that url and handles it instead of sending on to Azure.

Client Specific Handlers

So now we can configure the Proxy to behave differently for different destinations, but what about different requesting clients? Yes we can, one (very simple) way is through ipaddress although we could also write domain specific cookies back to the user from the Proxy. I’ve used this ability to add some additional functionality using a handler in the pipeline if a user has registered for it. Try this (when your proxy is set to busker.cloudapp.net port 8080) http://www.buskerimage.com/flip or this http://www.buskerimage.com/bw. This website (www.buskerimage.com) doesn’t actually exist but the proxy is listening for that domain and using it to configure the proxy. In both cases the Proxy is remembering your ipaddress and manipulating the images on the way through either flipping them or going old school black and white (or both). Here’s what Bing Images looks like with both on.

Note only requests from and ip address registered for black and white and flipped images will see this view. OK so that’s all fun, what about the serious stuff? That’s in next blog when these building blocks are put to good use. But it’s fun to play with for now.

Running on Azure for now here: busker.cloudapp.net:8080

Available on github here: https://github.com/petermreid/buskerproxy