What is Cosmos DB
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service. With a click of a button, Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure regions worldwide. You can elastically scale throughput and storage, and take advantage of fast, single-digit-millisecond data access using your favorite API including: SQL, MongoDB, Cassandra, Tables, or Gremlin. Cosmos DB provides comprehensive service level agreements (SLAs) for throughput, latency, availability, and consistency guarantees, something no other database service offers.
Our Use Case
We have an on-premises Java application which we are in process of migrating to cloud. We have 3 major things to consider
- Syncing data from on-premises to Cosmos DB
- Migrating historical data on-premises to Cosmos DB
- Enabling traffic on cloud.
Not going into details intentionally…
What we started with
We started using SQL flavor of Cosmos DB in mid 2019. We soon realized that the azure cosmos client library was not mature enough in terms of handling aggregation functions like count(), pagination etc. i.e. You will have to loop through all the available pages within the code to get to a specific page rather than retrieving specific page directly from DB. This made us look for other API flavors of Cosmos DB and that is when we choose Mongo API flavor.
Using Mongo API flavor of Cosmos DB
This API perfectly suited our use case. We used spring mongo DB library along with plain vanilla mongo client which supports aggregation functions like count() and pagination etc. elegantly. We were very happy with mongo API flavor with regards to all our functional requirements. Soon we started moving to next stages like performance testing, understanding db latencies and cost. I will discuss what we discovered during these phases in the next section.
Cosmos DB Limitations by design and its impact application
What is RU
Please read through this https://docs.microsoft.com/en-us/azure/cosmos-db/request-units article better understand.
Cosmos DB supports high availability and resiliency by allowing to create instances across different regions. However, this comes with a cost. :) Let’s say, you have “n” regions to achieve “high availability and resiliency”, so when you configure RU’s on 1 region the Cosmos DB internally replicates the same number of RU’s across all other regions. This means that you will get charged for x RU’s multiplied by n number of regions. Well, you might argue that there is nothing wrong here. However, consider a real world scenario where your application is geo replicated to different regions and so does the CosmosDB. Most of your customer base is in a specific region and you are getting 60% traffic from 1 region and rest 40% in another region. However, you are paying for the same RU’s in these regions irrespective of the usage.
There is a feature called auto-pilot/auto scale which will be GA somewhere around July/August 2020 timeframe. this could be a game changer in terms of cost.
Growing is Ok but Shrinking is an issue
There are 2 types of partitions in Cosmos DB
- Logical partitions: This is what the application can control and totally depends on your shard key. Each logical partition can grow up to 20GB max, so be careful when choosing your shard key. :)
- Physical partitions: Cosmos DB controls this. More than 1 logical partitions can exist within a single physical partition. Each partition can grow up to 50GB max.
We had set very high number of RU’s to some of our collections during data migration process i.e. 1 million RU’s. This internally created 100 physical partitions on cosmosDB (10k RU per partition). After our data migration task was completed we reduced the RU’s as per our normal load i.e. 300K RU’s. This caused lot of 429 issue (https://docs.microsoft.com/en-us/rest/api/cosmos-db/http-status-codes-for-cosmosdb) i.e. RU consumption is higher than configured. This happens because now we have 100 partitions but 300k RU’s only which translates to 3k RU per partition. CosmosDB does not shrink/merge the physical partitions based on the reduced RU’s.
RU limitation per partitions
Each physical partition can have max 10k RU’s configured. Currently there is no way to increase this limit. This might be a potential problem where your application has some surge in traffic which would be 5x or 10x of your normal traffic and let’s say each partition is using 5k RU’s on average, so if we extrapolate with to 5x or 10x load then it means that each partition needs at least 20–25k RU’s which is not possible :(
Retries for transient errors
Specifically the Mongo flavor of CosmoDB does not have an elegant way of returning standard http error codes like (https://docs.microsoft.com/en-us/rest/api/cosmos-db/http-status-codes-for-cosmosdb) so the client needs to implement a custom logic to catch and retry specific exceptions (This is in terms of java application)
Choosing right consistency level
We initially started with “Session” consistency level as this is default and recommended for most of the use cases. However, during our implementation phase we realized that due to certain limitations with Mongo flavor the ultimate consistency we get is “Eventual” which is the weakest of all. We migrated to “Bounded Staleness” consistency which is the strongest consistency available for a multi-region CosmoDB but it is 1x to 2x costlier than “Session” in terms of RU consumption.
see https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels for more details.
API does matter
Please make sure to review all your uses cases before hand to decide on what API flavor of Cosmos DB you want to choose. In our case we went ahead with some assumptions for SQL flavor API but then we had to do some re-work to retrofit Mongo API flavor. Also, One observation here is that as SQL API is cosmos native API solution and a default choice you will see features coming rapidly on for this compared to any other API flavors.
Cosmos DB is a very good choice as a NO SQL DB. However, I encourage each one of my fellow engineers to review these aspects and decide on the strategy when you are working in Azure cloud environment.
Disclaimer: Some of these the details/limitations might get outdated as Azure is continuously working on enhancing their cosmos DB product.