The New SharePoint App Architecture

To be honest, I’ve had this code for a while now sitting in the blog cupboard. It was originally intended to be used as part of a global content management system delivering content from a single SharePoint authoring point to many distributed content delivery systems. Since then, events have conspired to bring it back out of the cupboard but for a slightly different purpose. As I said in my previous blog, two things have happened in the world of SharePoint:

  • SharePoint went “Online” which means it is hosted centrally in a data centre (most likely) far-far away.
  • SharePoint Online got picky about what you can run in the SharePoint application and “spat out” other application functionality to be hosted elsewhere.
    (This incidentally, not only makes security sense for Microsoft but also economic sense as you will most likely pay for your SharePoint license and some other hosting environment for your apps as well!)

So back to the problem; in short what used to look like this for most on-premise SharePoint deployments:

Has become a more challenging architecture like this:

Where SharePoint Online environment, the App hosting environment and the user may be all separated by significant distance. To be fair, Microsoft did see this coming. After a less than successful attempt at moving SharePoint 2007 online with “BPOS” and then the slightly more successful attempt (using the “Sandbox”) with the SharePoint 2010. There was realisation that the only truly flexible solution to multi-tenant customisation support is to let you run outside SharePoint and call back in through services. That’s why so much work has gone into the SharePoint Web Services and in particular the expanded SharePoint 2013 REST interface.

This blog looks at a solution for bringing the set of SharePoint Web Services (.asmx) closer to the consumer and (if I get time) I’ll have a go at doing something similar with the REST interface.

SharePoint Web Services

The inspiration for building a SharePoint Service cache came from watching how one of the Microsoft provided caches works under the hood. SharePoint Workspace (or SkyDrive Pro) uses the SharePoint Web Services to maintain a remote copy of your SharePoint data. Sync a library with SkyDrive Pro and double click on a folder while Fiddler is running to see what happens.

  • _vti_bin/Webs.asmx (GetWeb)
    Hey SharePoint tell me about this library, and am I still allowed to sync it?
  • vti_bin/UserGroup.asmx (GetCurrentUserInfo)
    Ah and this person trying to sync, do they still have access to this library?
  • _vti_bin/Lists.asmx (GetListCollection)
    Ok tell me about the Lists and Views you have
  • _vti_bin/Lists.asmx (GetListItemChangesWithKnowledge)
    Oh now that’s interesting

Knowledge is a term that turns up in the world of the Sync Framework ad reveals the heritage of these services. Incidentally knowledge looks something like this:

<syncScope>{9FA5B692-0736-4722-9C2E-880F3CDDDC2C}</syncScope>
<knowledge>
  <sync:syncKnowledge xmlns="http://schemas.microsoft.com/2008/03/sync/" xmlns:sync="http://schemas.microsoft.com/2008/03/sync/">
    <idFormatGroup>
      <replicaIdFormat sync:isVariable="0" sync:maxLength="16"/>
      <itemIdFormat sync:isVariable="0" sync:maxLength="16"/>
      <changeUnitIdFormat sync:isVariable="0" sync:maxLength="1"/>
    </idFormatGroup>
    <replicaKeyMap>
      <replicaKeyMapEntry sync:replicaId="XkTpcTZbtEKiPUeWmKjV+Q==" sync:replicaKey="0"/>
      <replicaKeyMapEntry sync:replicaId="N72lfr+UT/qOkqOR0vmh1A==" sync:replicaKey="1"/>
    </replicaKeyMap>
    <clockVector>
      <clockVectorElement sync:replicaKey="0" sync:tickCount="13009622840318"/>
      <clockVectorElement sync:replicaKey="1" sync:tickCount="17943711"/>
    </clockVector>
  </sync:syncKnowledge>
</knowledge>

We don’t have to fully understand it, but it’s enough to say that the client is keeping a token of the last state and making a cheap call to SharePoint to determine if there are changes before showing the user a SkyDrive Pro folder with all green ticks. SharePoint, for its part, is maintaining a change list of all changes to the library which can be checked and walked down by clients and applied as a diff to the client copy of the library to keep it up to date. Wouldn’t it be good if we could implement something similar for SharePoint Web services? Intercept the normal web service call, check if the SharePoint data has changed since last time and respond locally with a cached response if there were no changes.

Well, we can.

WCF Client Cache

The solution works by implementing a “Cached Query and Results” (again see my previous blog) using a WCF Custom Channel to do the caching inline without the client being aware of it. By implementing such a channel, the client can “configure in” the cache generically without affecting the native web service API or the clients. This generic solution is covered well in this article by Paolo Salvatori and this source code.

I won’t revisit the implementation details but we will be using this component, updating it and adding some features that enable better caching outcomes for systems that expose a change list tokens like SharePoint does.

At this point if you want to understand the internals of the cache implementation read on of skip to the SharePoint Example configuration below.

Caching Restrictions

At the heart of the WCF Client Cache is the ability to plug in a number of different cache providers. By far the most powerful and scalable is to use AppFabric Cache. The use of AppFabric gives us some powerful building blocks and is now available either on-premise as a service, as a service in Azure or as a component running in an existing WebRole or WorkerRole in Azure. The beauty of a distributed cache like AppFabric is

  • Cache clients can subscribe to in memory copies of the cache delivering high performance for highly used cache entries
  • Cache clients can be kept consistent, or nearly consistent across a farm.

We won’t be using any cache features that are not available in the Azure AppFabric Cache service, (such as notifications and regions). While that might seem restrictive, it actually simplifies the solution space and keeps the solution cloud deployable. But the impact of that choice is cache invalidation becomes more difficult. Ideally we’d like to be able to knock invalid cache entries when the change token changes, but without Cache Regions enumerating cache entries is not supported. For that reason, and for simplicity; we won’t strictly use cache invalidation, but rather “Cache Deprecation”.

The idea here is to use the fact that a cache is not a database; we don’t need to clean up and delete old entries to save space. All good caches are self-cleaning and will typically use an LRU algorithm to get rid of old items that are no longer needed. Cache deprecation means we don’t bother deleting old cache entries, they can remain in the cache as long as they are used, however we will move on from old to new entries by carefully choosing a cache key that represents change.

This has a couple of benefits

  • No need to implement a cache invalidation event where we need to find and eliminate old entries from the event.
  • Old cache entries can continue to be used if needed which means no cache invalidation storm where the server is hammered for new data.

Improvement 1: Client Cache Key Generation

The WCF Client Cache does a good job of allowing clients to choose a cache key generation option. But it is still restricted to looking at the incoming request and generating a key based on the parameters passed in. Generation of a good cache key is important to clients and may require more information than just the incoming request. The WCF Client Cache now supports a plug in model which can be used to pass back the responsibility of cache key generation to the calling application by implementing and registering and configuring in a class that supports the IKeyGenerate interface.

That’s a pretty powerful plug in point, and in a system where we can cheaply determine if there have been changes on the source system, we can use that point to generate a cache key that represents changes on the server by making outgoing requests to the server to ask for the current “change” state.

But such a system isn’t much good if every client request results in a request to the Change Service and potentially another request to the Service itself. Remember the whole point of this is to remove chattiness and hide away the effect of latency, but we’ve just added another call across the network. OK, now this is where it gets mind bending. What if we also cache the request for the change token too?

What this enables us to do is cache the change tokens for some fixed period of time (n) minutes and cache the Service responses forever (they will eventually fall out or get pushed out of the cache). The responses received by the client will be at most n-minutes behind the changes on the server because when a change is detected on the server (moving from Token n to Token n+1) the WCF Client Cache will go looking for new cache entries.

Token

Request CacheKey CacheData

Token(n)

RequestA Token(n)1RequestA ResponseA
RequestB Token(n)RequestB ResponseB
RequestC Token(n)RequestC ResponseC

Token(n+1)

RequestA Token(n+1)RequestA ResponseA’
RequestB Token(n+1)RequestB ResponseB’
RequestC Token(n+1)RequestC ResponseC’

Of course the cost of this is that if there are a lot of changes on the server (ie a lot of different Tokens) then we’ll leave a trail of unused cache entries.

Improvement 2: MaxStale and Async Requests

MaxStale is a concept borrowed from http 1.1 request protocol.

“If max-stale is assigned a value, then the client is willing to accept a response that has exceeded its expiration time by no more than the specified number of seconds.”
When we say an item expires in 1 minute do we really mean that at exactly T+1 that item is totally unusable? Normally there is some tolerance for use of old items. MaxStale is the way to express that there is some tolerance for using old expired items. This is subtly different to making the expiry longer because it gives us an opportunity to return stale data and concurrently freshen up the stale data by issuing an async request for data. Incidentally using a large MaxStale value has the added benefit of being able to protect against service outages, as old data will be returned rather than failing the service call.

Improvement 3: Locking

Where we have heavily used keys (like the changetoken) or where the cost of sourcing data from the service is high, we have the possibility of multiple detections of the same cache entry expiration by cache clients each of which will issue a request to the service for data. For heavily used cache keys such as the change tokens, it’s worthwhile implementing a lock mechanism to prevent multiple service requests.

Example: SharePoint User Profile Cache

As an example we’ll create a “Client” SharePoint test application which is simply a web reference to the UserProfileService on SharePoint. It will run locally and request user profile information for 30 users (via the web service UserProfileService.asmx) from a remote “Server” SharePoint site hosted in Office 365 Data Centre in Asia. We will use the WCFClientCache to inject caching under the hood (remember the client code has no idea the caching has been injected as it is added using the .config file) for the calls to SharePoint and respond if possible from the cache.

So enough theory, let’s see it in action by load testing with 4 concurrent users for 4 minutes.

Baseline

First up a baseline no-caching load test.
          <cachingBehavior enabled=false>

AVG:0.42

The test shows a response time of between 0.37 and 0.65 and averaging 0.42 seconds. This is pretty much what we would expect from a remote server half way around the world.

1 Minute Cache

Now let’s add in the cache of 1 minute.
<cachingBehavior enabled=true timeout=00:01:00
    header=true keyLock=false cacheType=MemoryCache
    keyCreationMethod=MessageBody regionName=Messages maxBufferSize=524288>



AVG:0.13

The test shows a response time of between 0.012 and 0.51 and averaging 0.13 seconds. From the diagram we can clearly see the short duration of the items coming from cache every minute a “Cache Storm” where items expire from cache and must to be refreshed from the server causing an increase in response time. Despite executing some 800 tests, only around 120 actually made it through to the server (ie 30 requests are required to fill the cache every minute of the 4 minute test, the rest of the requests come from cache).

MaxStale 30 seconds

Now let’s add some tolerance for stale items with MaxStale of 30 seconds, plenty of time to get a fresh one from the server.

<cachingBehavior enabled=true timeout=00:01:00
    maxStale=00:00:30 header=true keyLock=false cacheType=MemoryCache
    keyCreationMethod=MessageBody regionName=Messages maxBufferSize=524288>

AVG:0.032

The test shows a response time of between 0.012 and 0.51 and averaging 0.032 seconds! Now we are getting close. Despite having a 1 minute cache on items, there is no bump at the 1 minute points since the cache storm has been mitigated by returning stale data while the new fresh data is retrieved from the server for next time. Question is can we do any better than that?

Change Tokens

What about changes on the server? The problem with the above solution is we are lagging behind the changes on the server by 1 minute of cache and up to 30 seconds of staleness. Wouldn’t it be nice to get new data when it changes on the server? Here we need to introduce operation level configuration and the plug in key provider UserTokenPrefixed which will check the SharePoint User Profile Change Token. As a reminder, we are configuring in the following between client and server.

          <cachingBehavior enabled="true" timeout="00:01:00" maxStale="00:00:30" header="true" keyLock="false" cacheType="MemoryCache" keyCreationMethod="MessageBody" regionName="Messages" maxBufferSize="524288">
            <operations>
              <!--UserProfile operations-->          
              <!--User change token is cached for a short period as we regularly go back and check so we can track server changes closely
              <operation action="http://microsoft.com/webservices/SharePointPortalServer/UserProfileChangeService/UserProfileChangeToken" 

        enabled="true" timeout="00:00:10" maxStale="00:00:00" keyLock="false" cacheType="MemoryCache" keyCreationMethod="MessageBody"/>
              <!--User Profile items are cached for a looong time with the change token on the cachekey-->
              <operation action="http://microsoft.com/webservices/SharePointPortalServer/UserProfileService/GetUserProfileByIndex" 

        enabled="true" timeout="00:01:00" maxStale="00:00:30" keyLock="false" cacheType="MemoryCache" keyCreationMethod="UserTokenPrefixed"/>

During this test I deliberately browsed to a user profile in SharePoint and updated a mobile phone number at around the 2 minute mark. Less than 10 seconds later the new change token was retrieved from the server, it was different to the previous one which was used to generate the cache keys. This caused a cache miss on all cache items until the cache filled again.

AVG:0.061

The test shows a response time of between 0.012 and 0.51 and averaging 0.061 seconds even though we are running at most 10 seconds behind the changes made on the server. Can we do any better than that?

KeyLock

Maybe. A close look at the above “Virtual User Activity Chart” shows 4 long blue bars where the change token on the server was detected by each of the 4 users simultaneously. High pressure items like this can really benefit from some locking to ensure only one of the 4 users makes the change token request and everyone else uses that response. That’s what “KeyLock” is for. So let’s try again:


AVG:0.049

Here we can see the 4 long concurrent bars that represent getting the new server token and a fresh piece of data are not present as only one of the clients made that request. The test shows a response time of between 0.012 and 0.51 and averaging 0.049 seconds while delivering content lagged by just 10 seconds from the server.

Conclusion

What we have built here is a sort of SharePoint Workspace for Web Services. And its not specific to SharePoint. The lesson from this is two fold:

  • Don’t panic if your server is now a long way away, there may be simple unobtrusive ways to cater for the latency introduced by remote services (or indeed otherwise under-performing services).
  • When designing services consider adding an operation to expose a server change token indicator (like SharePoint does) to help remote clients stay in sync.

WCF is all a bit 2010 now. SharePoint has gone REST (although I deliberately chose UserProfile as it is not adequately implemented in SharePoint REST services yet). The plug in model of WebAPI is actually simpler and provides the same power to influence both server and client side. So next stop we look at some WebAPI alternatives to do the same.

Source Code is here

Category:
SharePoint, WCF

Join the conversation! 1 Comment

  1. […] are unavoidable as we adopt a variety of dispersed Cloud deployed services. The first was using a WCF Custom Channel to cache SharePoint content which is now a bit old school. This is the […]

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: