DynamoDB
Paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Dynamo is distributed key-value store, i.e. Distributed Hash Table (DHT) , DynamoDB is AWS database product based on Dynamo.
Why DynamoDB
- highly available (AP model, whereas Bigtable is CP focus on consistency) 
- highly scalable 
- schema-less 
- why not - if requires strong consistency 
 
Architecture
- Data distribution: consisten hashing 
- Replication: optimistic, eventual consistency 
- Handling failure: sloppy quorum(hinted handoff) 
- Inter-node communication: goosip protocol 
- Conflict resolution: vector lock 

Data model
- Dynamo is key-value store. Value is a set of attributes. 
- Partition key: input for hash function(MD5) -> partition(physical storage) 
- Partition key + sort key - composite primary key 
- items with the same partition key value are stored together, in sorted order by sort key value. 
 
- Secondary index - global 
- local 
 
- Query pattern - Range query is not well supported, unlike Bigtable scan. E.g. - select * from A where age > 20- Since data is distributed in many nodes, this is basically a full scan or the cluster. 
 
- Solutions (either way, we may still end up doing full scan) - use range key(aka. sort key) 
- create global secondary index (reference) 
 
 
- Local persistent storage - These (key, value) pairs are stored within that node using various storage systems depending on application needs. A few examples of such storage systems are: - BerkeleyDB Transactional Data Store 
- MySQL (for large objects) 
- An in-memory buffer (for best performance) backed by persistent storage 
 
 



Implementation details
Optimistic replication
Coordinator node stores data first, then asynchronously replicates to next N nodes in the background. Replicas are not guaranteed to be identical at all times.
If a client cannot contact the coordinator node, it sends the request to a node holding a replica.

Add & remove nodes
- Add nodes, aka. bootstrapping. - Define how many VNodes the new machine is responsible 
- Allocation algorithm in the new node pick random VNodes(reprented as tokens). 
- New node requests current replicas of those tokens to stream data. 
 
- Remove nodes. - Send command to remove a node. 
- The old node reassigns tokens to other nodes. Also replicates data to those nodes. 
 

Sloppy quorum
- Preference list - The list of nodes responsible for storing a particular key. E.g. for key "K", preference list is [server 1, 2, 3, 4] where 4 is to store hinted replica. 
 
- Sloppy quorum ensures "always writable" - When a node is unreachable, another node can accept writes on its behalf. The write is then kept in a local buffer and sent out once the destination node is reachable again. 
 
- Drawbacks: data conflict 

Vector clock
- Clock skew - In distributed system, we cannot assume that wall clock time t on node a happened before time t + 1 on node b. 
- Hardware solution: GPS unit, atomic clock. 
 
- Dynamo returns conflicting versions to client who are responsible to explicitly reconcile the conflict. 
- Alternative way to handle conflict - Model data as Conflict-free replicated data types (CRDTs). E.g. adding item A, B to shopping cart can be handled anywhere anytime, because the end result is 2 items in the cart anyway. - AKA strong eventual consistency 
- Downsides: not easy to model every data as CRDT 
 
- Last-write-wins (LWW). - Downsides - using wall clock doesn't guarantee a real last write. 
- can easily lost data, e.g. for 2 concurrent data, 1 of them is thrown away. 
 
 
 

Put & Get operation
- Choosing coordinator node - Via load balancer - Pros: system loosely coupled, helps scalability 
- Cons: extra hop increase latency, waste of resources 
 
- Dynamo use client library that maintains server address - Pros: zero-hop DHT, low latency 
- Cons: less control of load distribution 
 
 
- Quorum - Read + Write > N(replicas) 
- A Common (N, R, WN,R,W) configuration used by Dynamo is (3, 2, 2). - (3, 3, 1): fast WW, slow RR, not very durable 
- (3, 1, 3): fast RR, slow WW, durable 
 
 
- Write(put) process - coordinate node generate vector clock 
- saves data in local 
- sends write request to N-1 nodes from preference list 
- returns successful after receiving W-1W−1 confirmation 
 
- Read(get) process - coordinate node request data from N-1 nodes from preference list 
- wait until R-1 replies 
- handle data versions (then update it back to nodes, aka. read repair) 
- returns all relevant data versions(may have conflicts) to the caller 
 
- State machine - Each client request results in creating a state machine on the node. (like request context?) 
- Coordinate node selection - Any top N nodes from preference list 
- A usual request patter is read-then-write, so a node replied fastest in previous read operation is chosen to handle write. The info stored in request context. This increases the chances of getting “read-your-writes” consistency. 
 
 

Anti-entropy Through Merkle Trees
- Read repair: use vector clock to resolve conflicts. But it's too slow if a replica falls too behind. 
- Merkle trees is used to resolve conflicts in the background. 
- Workflow - Data is split into smaller parts. Merkle trees is a binary tree of hashes. 
- Compare 2 tree from root to leaves. If encounter difference, resolve conflicts in this small subtree. Thus data transmission and work load is minimized. 
 
- Drawbacks: have to recalculate the tree when data changes. 

Gossip protocol
- Peer-to-peer communication 
- Each node periodically exchange info(basically the copy of hash ring) with random nodes. 
- Use seed nodes to bootstrap. More read on Uber's Ringpop. 

Inspired by Dynamo: Cassandra and Riak

Criticism on Dynamo
- Each Dynamo node contains the entire Dynamo routing table. This is likely to affect the scalability. 
- Security concerns, e.g. no ACL. 
- Dynamo’s design can be described as a “leaky abstraction,” where client applications are often asked to manage inconsistency. 
Reference
Last updated
Was this helpful?