Sitecore login performance slows down over time

SitecoreOn a recent project using Sitecore 6.6, we ran across a strange performance problem with logging in visitors to the site.  As the day went along, the response time for logging in a visitor to the site would slow down. Combined with Windows Authentication being required, this meant that initial loads of all pages were slowing down for all users throughout the day.  The initial login times were in the milliseconds range, but by mid-day some users were taking up to 8 full seconds just to complete a login call, aside from any page load times.

What was causing this?

The Scenario

Picture this: multiple load-balanced Sitecore content delivery nodes forcing Windows Authentication and connected to Active Directory. Any hit to the site triggers a login, so there is no anonymous access.  On session start, the login begins, and begins to slow down…

The Offending Line of Code

Using some .NET Stopwatch code in our Global.asax, we tracked down the offending line of code to the following:

bool success = Sitecore.Security.Authentication.AuthenticationManager.Login(username, false);

I’ve used this bit of code in plenty of projects, and I’ve never seen performance issues with it before. Not only that, but response times from this call were slowing down over time, not just being slow all the time, so we knew that it wasn’t an issue with connecting to the membership provider data store. In fact, an application restart of Sitecore was enough to bring performance back to normal.

This sounded fishy…

Under the Hood

Using a decompiler, I took a look at the code to figure out what it was doing. This was what was in the Login code:

public static bool Login(string userName, bool persistent)
 {
     AuthenticationManager.ClearSecurityCache(userName);
     return AuthenticationManager.Provider.Login(userName, persistent);
 }

Nothing much to the code, it just empties the cache and then executes the login. The ClearSecurityCache was just a call out to the CacheManager class, so I took a look into the caching layers and by digging down I found the following in the inner cache class (Sitecore.Caching.Cache):

public void Remove<TKey>(Predicate<TKey> predicate)
 {
     Assert.IsNotNull((object) predicate, "predicate");
     List<TKey> list = new List<TKey>();
     foreach (TKey key in this.GetCacheKeys())
     {
          if (!object.Equals((object) key, (object) default (TKey)) && predicate(key))
               list.Add(key);
     }
     lock (this.SyncRoot)
     {
          foreach (TKey item_1 in list)
               this.Remove((object) item_1);
     }
 }

The above code was being executed for three different caches: the User Profile cache, the Is In Role cache, and the Access Result Cache. Note the ‘lock’ on the SyncRoot object and the foreach over all keys.  This seemed like a place where the code could slow down as the cache grew, as well as possibly cause a queue of users when the lock was made.  However, while it seemed to be a logical cause for a performance degradation, it was not solid evidence.  Some trial and error was needed.

Proving the Theory

For some reason, performance testing was never reproducing this issue. The team tried creating some test harnesses to reproduce it in our own environment and could not. It was always fast, no matter what was done. Sitecore even helped us out and looked into the AD module our team was using to see if there was something happening there. No issues found. Were there any networking issues queuing up users going into production? The networking team ran some tests and could not find any latency problems. What was going on?  Why was it slowing down in production, but not anywhere else?

pt_headerlogoIf the code is the same, and the same Active Directory data store is being used, and the same load of users is being simulated, and networking tests show no latency issues, the only thing we could think of was the users themselves.

Looking at the user accounts we had been testing with, they belonged to a modest number of roles (around 10). Nothing huge, fairly standard. However, some of the users in production had upwards of 100 roles in Active Directory. Add to this the usage of indirect membership and each user was having calculations done for a large number of roles, many of which had no bearing on the system.  Also, while the performance tests simulated the load, it was not simulating the number of distinct accounts.

To try to reproduce, the performance test users were altered to increase the number of roles they belonged to. While the full impact of the performance issue couldn’t be reproduced since we still did not have the same number of distinct accounts, the login times did slow down as the performance tests ran.  The conclusion was that the number of unique keys in the caches did indeed impact performance.

Fixing the Login Performance Issue

I was fairly confident that the locking mechanism, combined with the number of keys, was causing a queuing problem. As more users hit the site, the caches got larger and larger. As users with large numbers of roles hit the site, the time to consume that user, clear their cache, and add new entries was taking an increasing amount of time.  So we needed to make sure users could get in faster, and would not need to wait in line to release their cache.

With a simple change, we bypassed the cache clearing code and went straight to the inner provider login method:

bool success = Sitecore.Security.Authentication.AuthenticationManager.Provider.Login(username, false);

This code is what is called by the Login method itself, after the security cache clear. Once we released this patch to production, login performance stopped degrading over time. Application recycles were no longer required, and login performance stayed static.

The Impact

Since the security cache is no longer being cleared on login, this means that the cache will be kept until it is cycled out of memory or the application recycles on the next IIS application pool recycling scheduling. There are some events in Sitecore that would also cause this cache to be updated, but for the most part these don’t occur daily on a content delivery server.

This impact definitely needs to be considered before implementing this solution.

Will this happen to you?

There are very specific variables at play in our situation, and from what I can tell, in order for this to become an issue you would likely need the following scenario:

  1. Fully authenticated site, no anonymous access.
  2. Executing an AuthenticationManager.Login call at session start.
  3. Role membership in an external provider such as Active Directory.
  4. Large number of unique users all hitting the site daily.
  5. Users with large numbers of roles.
  6. Having the site as the default home page for all users so that any browser usage hits the site.

If all of these elements apply to you, then your Sitecore implementation may need to be investigated for this possible issue. Make sure your performance tests simulate this before launch!

Hopefully this was of some help to anybody else who has seen this degrading login performance. I know it was a first for me!

About these ads

About Jason St-Cyr
Solution Architect with 15 years of experience in the software development field. Into ALM, integrations, software architecture, and stopping slapshots. Find me @Google+

7 Responses to Sitecore login performance slows down over time

  1. kiranpatils says:

    Hello Jason,

    Thank you for such a nice and in-depth article!

    What is your AccessResultCache Size? It is same in local and production? What’s the average occupied size of it. This article may not apply for you. But it’s a good read — http://sitecorebasics.wordpress.com/2013/01/24/do-you-really-need-accessresultcache-on-cd-servers-if-no-then-disable-it-for-better-performance/

    Keep writing! Keep sharing!

    Sincerely,
    Kiran Patil

    • Jason St-Cyr says:

      Thanks for the great article link, Kiran. I do believe it is a similar issue. As I mentioned I the blog, we had a very large size for the Access Result Cache, similar to the issue from the article you linked.However, our CD servers require authentication for this project, so we could not solve this issue by disabling the cache.

      • kiranpatils says:

        Jason – Pleasure! On your CD servers, you provide access of CE/PE? If yes then item selection should be slower. And if not load time of a page should be slower. I would suggest you to reduce it to 50 MB or so and give it a go! As far as I can recall, Sitecore support guys knows about this issue. Will pull out the details and share with you.

    • Jason St-Cyr says:

      Kiran – I assume by CE/PE you are referring to Content Editor and Page Editor. No, those are not accessed on the CD servers. We had discussed with Sitecore support, and it seemed to be related to the number of keys in the hashtable, not necessarily the memory size. Tweaking memory did not help.

      Also, we are using the access result cache to cache the user security to improve performance, so by reducing the cache size we would actually introduce a greater performance issue. This was why our team went with the approach outlined in the article.

      • kiranpatils says:

        Yes Jason — Yes! — Okay I found the ticket, It is Issue#377616. This hotfiix The hotfix implements a special logic of dictionary based indexing for ItemCache and AccessResultCache caches. In a nut shell it builds special dictionary-based indexes for the caches with composite key. Further Sietcore uses the index in order to remove cache entries for updated items/users. They have introduced new configuration key for the same “Caching.CacheKeyIndexingEnabled” Also from release history they have clubbed it with SC 6.6 with slight difference in name — http://sdn.sitecore.net/SDN5/Products/Sitecore%20V5/Sitecore%20CMS%206/ReleaseNotes/webConfig/660_130529.aspx

        Your solution looks fine to me as well. Just wanted to share our learnings with you!

  2. netzkern says:

    Thanks for sharing! We never experienced any problems like that with our sites, but it’s good to know what to look out for.

    • Jason St-Cyr says:

      Thanks for reading! There are very specific scenarios that trigger this sort of behaviour, so count yourself lucky not to have hit it yet. When this hit, we were all stumped, since none of us had ever seen this behaviour in any site done by our company in all the years and projects we’ve done.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 39 other followers

%d bloggers like this: