Skip to content

Designing Database

  1. Spend more time understanding the concepts from MS Learn to understand the "why" behind the design decisions.
  2. Would suggest to do questions from youtube or sample questions to tweak your understanding.
  3. Understand the trade-offs of each design decision.
  4. Understand partitioning and reasons to partition.
  5. Understand the importance of denormalizing table.

Embedded vs Reference

In many cases, in Azure Cosmos DB, embedding data reduces the number of requests during reads and writes, which improves performance and reduces costs. This applies to scenarios where entities represented by the data model have a 1:1 relationship, a 1:few relationship, or their data is read or updated together. However, for a 1:many relationship especially unbounded ones, you should use data referencing to prevent reaching the maximum item size limit.

Consider reference if:

  1. Read or updated independently: This is especially true where combining entities that would result in large documents. Updates in Azure Cosmos DB require the entire item to be replaced. If a document has a few properties that are frequently updated alongside a large number of mostly static properties, it's much more efficient to split the document into two. One document then contains the smaller set of properties that are updated frequently. The other document contains the static, unchanging values.

  2. 1:Many relationship: This is especially true if the relationship is unbounded. If you have a document which increases in size an unknown or unlimited amount of times, the cost and latency for those updates will keep increasing. This is due to the increasing size of the update costing more RU/s and the payloads going over the network which itself, is also inefficient.

  3. Many:Many relationship: We'll explore an example of this relationship in a later unit with product tags. Separating these properties reduces throughput consumption for more efficiency. It also reduces latency for better performance.

Take Note: There are difference between 1:few and 1:many. We take 1:few like customer X customer address, as it's only a handful of customer address (5-10). If it's a lot more then it's best to use reference.

Schema Versioning

As NoSQL is schemaless, we need to version the schema to handle changes in the data model. Useful if queries are complex and need to handle different schema versions.

https://learn.microsoft.com/en-us/samples/azure-samples/cosmos-db-design-patterns/schema-versioning/

Just to note using user defined function (UDF). can help determine if new schema has new fields.

Modelling Cosmos DB for IoT

Example: You are designing an Azure Cosmos DB for NoSQL data model to be used for vehicle tracking.

Each vehicle sends several data points per minute that include the following properties:

  • licensePlate
  • date
  • time
  • longitude
  • latitude

You need to optimize the model to support concurrent writes, while also being able to identify the location of a vehicle at any given date and time.

Which two properties should you use for the partition key? Each correct answer presents part of the solution.

Feature Composite Key: /licensePlate/date Composite Key: /licensePlate/time
Write Optimization Excellent. Spreads continuous write load by changing the partition key only once per day per vehicle. Poor. Concentrates write load into rapidly changing, temporary hot partitions.
Query (Daily History) Excellent. Single-partition query. Fast and low RUs. Poor. Cross-partition query spanning potentially thousands of partitions. Slow and high RUs.
Partition Size Partition key scope allows 20GB of data. A single day's data for one vehicle is unlikely to exceed this limit. Data is rapidly distributed across many new logical partitions, but the churn is high.

Rule of thumb for partition key

  1. High Cardinality: Always choose a partition key with "high cardinality" (a large number of distinct values) to ensure data and requests are evenly distributed.
  2. Static over Uniqueness: Choose a property whose value won't change over time ("static"). A purely unique id per item is perfectly distributed, but forces you to do cross-partition queries when reading multiple items.

Why not "Unique"?

While using a Unique partition key (such as the item's id property) is an extreme form of high cardinality and would perfectly distribute the data, it violates the read-heavy pattern goal.

The workload supports reading product data and product category data. If you use a unique key, every common read query (e.g., "get all products in category X") would be a cross-partition query (fan-out), which is the most expensive and least performant type of query in Cosmos DB, drastically increasing RU consumption and maintenance costs.

A key like a categoryName or a synthetic key that groups related data for efficient read queries is usually preferred over a purely unique ID for this scenario.

Normalized data update

Use an Azure Function trigger for Azure Cosmos DB to use the change feed processor to update the referenced containers. This is good for a parent container updating referenced container; E.g. "Product Tags" container needs to update Products container if Product tags are embedded within Products.

If you embed references (like tags) inside items, you deliberately introduce data duplication to speed up reads. To ensure this duplicated data remains consistent, you must use the change feed to synchronize and propagate updates back and forth across containers.

This normalised update is explained in MS Learn Normalization Patterns.

Partitioning Strategy

See partitioning section.