Dashboard Analytics Performance Enhancement

Overview: 

DSS is a configuration based data aggregator service which executes ElasticSearch queries in the background to extract meaningful statistics from modular data.

Being a data aggregator, DSS is very read heavy and DSS does not undertake any write operations. One possible way to improve the performance is to introduce a caching mechanism which will sit as a buffering layer between the application and ElasticSearch data store. 

Implementation details:  

As far as implementation is concerned, the plan is to use a cache which will be shared across the pods of DSS. That way, we can have multiple pods of DSS to support a large number of concurrent requests while taking full advantage of the caching mechanism in place.

Now, the query results will keep changing as and when modular data is indexed onto ElasticSearch. So, we need a cache updation/eviction policy also in place. For this, since there is no update API specifically which will trigger the cache updation/eviction process, we decided that the best way to go about this situation is to define a TTL of 5-10 minutes for the data inside our cache. This ensures that the search response is in sync with the index almost at all times since the TTL is small.

Observations and Results:

We traced dashboard analytics API requests using Jaeger. We realized that query execution was adding the most to the latency. We decided to cache query execution results at the DAO layer and ran the tests. This did not impact the performance much. We traced the requests through Jaeger again and realized that there is a lot of pre-processing going on for preparing “execution ready” requests ( base query with all the necessary filters ). So, we decided to cache results at the Service layer. This improved our performance drastically. Below are the results.

  1. Dashboard-analytics pod without caching (develop branch build) - 

  • Number of concurrent users - 125

  • Duration of bombaring service with requests - 10 minutes

  • Number of zuul pods - 1

  • Number of dashboard analytics pods - 1

2. Dashboard-analytics pod with caching - 

  • Number of concurrent users - 125

  • Duration of bombarding service with requests - 10 minutes

  • Number of zuul pods - 1

  • Number of dashboard analytics pod - 1

Awesome! Our theoretical assumption that caching should improve the performance of dashboard analytics was indeed correct. So, next, we increased the number of concurrent users who are bombarding requests to 150. In this case, zuul started crashing as it was not able to keep up with the pace at which dashboard analytics was serving requests. Even after scaling out the number of zuul pods to 3, we were not able to see much improvement in the number of requests catered, so, going forward we will explore how we can scale zuul.