DynamoDB
Last updated
Last updated
Paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Dynamo is distributed key-value store, i.e. Distributed Hash Table (DHT) , DynamoDB is AWS database product based on Dynamo.
highly available (AP model, whereas Bigtable is CP focus on consistency)
highly scalable
schema-less
why not
if requires strong consistency
Data distribution: consisten hashing
Replication: optimistic, eventual consistency
Handling failure: sloppy quorum(hinted handoff)
Inter-node communication: goosip protocol
Conflict resolution: vector lock
Dynamo is key-value store. Value is a set of attributes.
Partition key: input for hash function(MD5) -> partition(physical storage)
Partition key + sort key
composite primary key
items with the same partition key value are stored together, in sorted order by sort key value.
Secondary index
global
local
Query pattern
Range query is not well supported, unlike Bigtable scan. E.g. select * from A where age > 20
Since data is distributed in many nodes, this is basically a full scan or the cluster.
Solutions (either way, we may still end up doing full scan)
use range key(aka. sort key)
create global secondary index (reference)
Local persistent storage
These (key, value) pairs are stored within that node using various storage systems depending on application needs. A few examples of such storage systems are:
BerkeleyDB Transactional Data Store
MySQL (for large objects)
An in-memory buffer (for best performance) backed by persistent storage
Coordinator node stores data first, then asynchronously replicates to next N nodes in the background. Replicas are not guaranteed to be identical at all times.
If a client cannot contact the coordinator node, it sends the request to a node holding a replica.
Add nodes, aka. bootstrapping.
Define how many VNodes the new machine is responsible
Allocation algorithm in the new node pick random VNodes(reprented as tokens).
New node requests current replicas of those tokens to stream data.
Remove nodes.
Send command to remove a node.
The old node reassigns tokens to other nodes. Also replicates data to those nodes.
Preference list
The list of nodes responsible for storing a particular key. E.g. for key "K", preference list is [server 1, 2, 3, 4] where 4 is to store hinted replica.
Sloppy quorum ensures "always writable"
When a node is unreachable, another node can accept writes on its behalf. The write is then kept in a local buffer and sent out once the destination node is reachable again.
Drawbacks: data conflict
Clock skew
In distributed system, we cannot assume that wall clock time t on node a happened before time t + 1 on node b.
Hardware solution: GPS unit, atomic clock.
Dynamo returns conflicting versions to client who are responsible to explicitly reconcile the conflict.
Alternative way to handle conflict
Model data as Conflict-free replicated data types (CRDTs). E.g. adding item A, B to shopping cart can be handled anywhere anytime, because the end result is 2 items in the cart anyway.
AKA strong eventual consistency
Downsides: not easy to model every data as CRDT
Last-write-wins (LWW).
Downsides
using wall clock doesn't guarantee a real last write.
can easily lost data, e.g. for 2 concurrent data, 1 of them is thrown away.
Choosing coordinator node
Via load balancer
Pros: system loosely coupled, helps scalability
Cons: extra hop increase latency, waste of resources
Dynamo use client library that maintains server address
Pros: zero-hop DHT, low latency
Cons: less control of load distribution
Quorum
Read + Write > N(replicas)
A Common (N, R, WN,R,W) configuration used by Dynamo is (3, 2, 2).
(3, 3, 1): fast WW, slow RR, not very durable
(3, 1, 3): fast RR, slow WW, durable
Write(put) process
coordinate node generate vector clock
saves data in local
sends write request to N-1 nodes from preference list
returns successful after receiving W-1W−1 confirmation
Read(get) process
coordinate node request data from N-1 nodes from preference list
wait until R-1 replies
handle data versions (then update it back to nodes, aka. read repair)
returns all relevant data versions(may have conflicts) to the caller
State machine
Each client request results in creating a state machine on the node. (like request context?)
Coordinate node selection
Any top N nodes from preference list
A usual request patter is read-then-write, so a node replied fastest in previous read operation is chosen to handle write. The info stored in request context. This increases the chances of getting “read-your-writes” consistency.
Read repair: use vector clock to resolve conflicts. But it's too slow if a replica falls too behind.
Merkle trees is used to resolve conflicts in the background.
Workflow
Data is split into smaller parts. Merkle trees is a binary tree of hashes.
Compare 2 tree from root to leaves. If encounter difference, resolve conflicts in this small subtree. Thus data transmission and work load is minimized.
Drawbacks: have to recalculate the tree when data changes.
Peer-to-peer communication
Each node periodically exchange info(basically the copy of hash ring) with random nodes.
Use seed nodes to bootstrap. More read on Uber's Ringpop.
Each Dynamo node contains the entire Dynamo routing table. This is likely to affect the scalability.
Security concerns, e.g. no ACL.
Dynamo’s design can be described as a “leaky abstraction,” where client applications are often asked to manage inconsistency.