Partitioning (Sharding)
In our first series, we understood Replication in Databases. Now, let’s discuss yet another aspect of databases, “Partitioning”.
Partitioning essentially implies breaking up data into smaller chunks or partitions or shards. Partition and replication go hand in hand. We use partitions or shards to achieve scalability and to spread the data and the query load evenly across multiple nodes. Thus, by partitioning, a large data set can be distributed across many disks and query load can be distributed across many processors, thus increasing the efficiency and reliability of our distributed system.
Let’s understand the principle of partitioning.
In the above picture, we see that we have four replicas or nodes of a database. Here, we also partition our data set into four partitions and place each partition in each replica(shown in purple). But what if we face data loss of one partition? To address this, we again follow the leader-follower mechanism for datasets. And so we randomly place the follower partitions in other nodes(replicas) too.
It is very important to understand that the distribution of data across nodes should be even. Uneven distribution or skew makes partitioning ineffective. Suppose there are 5 partitions and every data or query end in only one of the partition, thus making the load disproportionately high at it. This type of partition is known as a hotspot.
WAYS OF PARTITIONING :
The simplest way of avoiding hot spots is to assign records randomly to the nodes. This would distribute the data evenly, but what when you are trying to find which data is on which node, thus making you query each node to find the data? We can try to make it more efficient by making our data model a key-value pair, just like a dictionary. Keeping in mind, the above scenarios, let’s see how we can partition the data efficiently.
1. PARTITIONING by Key Range:
Here, we assign a continuous range of keys to each partition. So, if you know the boundaries between the ranges, you can easily determine the partition.
ADVANTAGE :
- Retrieval of data becomes efficient as we can directly query the node with the given key range.
- With each partition, we can keep keys in sorted order, thus making scans easier and faster.
- Key can be treated as concatenated to fetch several related records in one query. For example, we can keep the key as (year-month-day-hour-second) and fetch any range of records (based on timestamp) very easily.
DISADVANTAGE :
- Ranges may be unevenly spaced, causing data to be skewed. For example, some ranges may be more populated than others.
- Certain access patterns can create hotspots.
Imagine you partition the data, making timestamps as keys, and requests in a certain time range may be more than the other. Thus, we use a concatenated key and change the first element of the key to something else, which makes our distribution less skewed.
Example database — RethinkDB, MongoDB(before v2.4),BigTable
2. PARTITIONING by Hash of Key
The risk of skew and hotspots are a very big problem in partitioning, and thus many distributed data stores use a hash function to determine the partition of a key. A good hash function takes up skewed data and makes it uniformly distributed. For partitioning, the hash function needs to be very cryptographically strong. Many programming languages have built-in hash functions but they may not be useful for partitioning as the same key may have a different value in a different processor.
The advantage of using this type of partitioning is “reducing skews and hotspots”. The major drawback of using this partitioning technique is that we lose a nice property of key-range partitioning, because the keys got into different partitions or shards, based on their hash value. Thus, any range query would have to be sent to all partitions.
CONCATENATED INDEX APPROACH
Cassandra achieves a compromise between the two partitioning techniques. A table in Cassandra can be declared with a compound primary key(key formed by concatenating multiple columns together). We then hash only a part of the key and the other concatenated columns are used for sorting the data in Cassandra’s SSTables. So, if we specify a fixed value of the hashed part, we can efficiently query the range based on the concatenated index.
For example, for a social media platform, you can create an index as “userId_timestampOfNewUpdate”, and hash userId. So now, whenever you want to find the updates of a particular user in a time frame, you can query based on userId efficiently.
PARTITIONING SECONDARY INDEXES
Till now we have discussed partitioning on the key-data model and if we determine the partition of a key, we can route read and write requests to it, but things become complicated when secondary indexes come into the picture.
Secondary indexes don’t usually identify a record uniquely, but rather are a way of searching for occurrences of a particular value
The problem with these secondary indexes is that they don’t map neatly to partition. Let’s see how we can partition the data efficiently based on these secondary indexes.
- By Document: Let’s suppose you want to create an index for searching cars based on their colors. Now each partition will have its own secondary index for each color. Because each partition has its own secondary index, this is also known as the “local index”. Thus, if you want to find all the red color cars, you have to query all partitions and gather their results. This process is also known as “scatter-gather”. Often we send queries in parallel, but it’s still prone to “tail-latency amplification”. Cassandra, Elasticsearch, and VoltDB use this approach
- By Term: As opposed to local index, this is also known as “global index” as we maintain a global index that covers data in all partitions. However, we can’t store the index on one node, since it may become a bottleneck, so we partition this “global secondary index” too. So, we may make a partition to store colors starting from a-m, in one partition and n-z in another. We call this type of index “term-partitioned” because the term we are looking for determines the partition of an index. “Term-based partition” makes reading very efficient as we don’t need to “scatter-gather”, but on the flip side, writing becomes slower and more complicated because a write to a single document may now affect multiple partitions of the index. Riak and Oracle Dataware houses use global term-partitioned indexes.
REBALANCING PARTITIONS
The process of moving load from one node in the cluster to another node is called rebalancing.
We might need to rebalance partitions because :
- Some partitions fail and we need to distribute them among available nodes.
- You want to add CPUs to handle the load caused by increased query throughput.
- The dataset size increases, so you want to add more disks and RAM to it.
Rebalancing must meet a few requirements
- After rebalancing, the load should be shared fairly between the nodes in a cluster.
- While rebalancing is happening, the database must keep on accepting reads and writes.
- No more data than necessary should be moved between nodes, in order to make rebalancing fast and minimize the network and disk I/O load.
The third point is very crucial because the performance of the system can largely depend on moving data between the nodes. For example, let’s say we use (hash(key) % total nodes) for rebalancing. Consider hash(key) is 123456, thus, if total nodes = 10, the key goes to node 6, if total nodes = 11, the key goes to 3, and so on. Thus, we see so many key transfers even for a small number of nodes being added, which increases the I/O.
Let’s see then, how can we efficiently rebalance :
- Fixed Number of Partitions :
- In this approach, we create many more partitions than the nodes we have. Let’s say we have a database cluster with 20 nodes, so we make 2000 partitions of the data so that each node has around 100 partitions. So, whenever a new node is added, you just take some partitions from the existing nodes and put them in the new node. Things that need to be kept in the mind are that number of partitions doesn’t change, only these are moved between nodes. This approach is used in Riak, Elastic database, Couchbase, and Voldemort.
- You must be wondering, that what must be the right number of partitions that must be made for this rebalancing. Essentially there is no hard and fast rule to it, but it is said that it should be neither too large(as rebalancing and recovery from node failures become expensive), nor too small(because fixed partition needs some overhead for partition management and smaller partition would incur more overhead. In reality, it’s difficult to get the “apt partitions”, when the partition remains fixed and data is ever varying.
- With a fixed number of partitions, the size of each partition is proportional to the dataset.
2. Dynamic Partitioning :
- As the name suggests, dynamic partitioning dynamically varies the number of partitions available.
- Here, the number of partitions is proportional to the size of the dataset.
- In HBase, and RethinkDB, which are key-range partitioned, partitions are created dynamically. When a partition grows to exceed a pre-configured size, it splits into two, and one partition may be transferred to a different node(in case the node is not able to handle a new partition). Conversely, if lots of data are deleted, the partitions merge into one.
- The main advantage of this process is that it adapts to the change in volume. The only caution to be taken here is that an empty database starts with a single partition. But databases like MongoDb and HBase, have an option of pre-splitting, thus allowing to instantiate databases with multiple partitions.
AUTOMATIC vs MANUAL REBALANCING
We have seen the ways of balancing but have you thought if it should be done manually or automatically? Rebalancing is an expensive operation as it involves routing requests and moving a large amount of data from one node to another. So there’s always a turf between automatic and manual. While automatic may seem convenient, because there is less operational work(without any need from an administrator), things can go unpredictable at any time. That’s the reason why many databases maintain a combination of both, like for every automatic partition, the administrator has to make the commit.
REQUEST ROUTING
You must have wondered, although this rebalancing, splitting, and partitioning goes in the background, how does the client know where to route the request? This brings us to the concluding part of our article, “request routing”. For connecting to a service, we must need its IP address and if due to partitioning it changes, it must be routed to the right URL.
On a high level, we may assume any one of the following that might be happening :
- Allow the client to connect to any node( via any load balancer algorithm). If that node coincidently owns the partition to which the request applies, it can handle the request directly or else forwards it to the appropriate node, receive the response and pass it to the client. Here, the node acts as an intermediary.
- Using an intermediary “routing-tier”, which routes the request to the appropriate node. It acts as a partition-aware load balancer.
- The client must be aware of the partitioning and the assignment of partition to the nodes, avoiding any intermediary.
But, in all these cases, one question still remains unanswered, who has the complete record of data partitioning and tracks the node information?
- Many distributed systems rely on a separate coordinations service, like “Zookeeper”, which keeps a track of cluster metadata. Each node registers itself in zookeeper, and zookeeper maintains the mapping of nodes to partitions. Other actors like, “routing-tiers”, can subscribe to zookeeper and route the request to the right node.
This was all about partitioning in databases and I hope you like it. Do comment and follow.