DynamoDB

Paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Dynamo is distributed key-value store, i.e. Distributed Hash Table (DHT) , DynamoDB is AWS database product based on Dynamo.

Why DynamoDB

highly available (AP model, whereas Bigtable is CP focus on consistency)
highly scalable
schema-less
why not
- if requires strong consistency

Architecture

Data distribution: consisten hashing
Replication: optimistic, eventual consistency
Handling failure: sloppy quorum(hinted handoff)
Inter-node communication: goosip protocol
Conflict resolution: vector lock

Data model

Dynamo is key-value store. Value is a set of attributes.
Partition key: input for hash function(MD5) -> partition(physical storage)
Partition key + sort key
- composite primary key
- items with the same partition key value are stored together, in sorted order by sort key value.
Secondary index
- global
- local
Query pattern
- Range query is not well supported, unlike Bigtable scan. E.g. select * from A where age > 20
  - Since data is distributed in many nodes, this is basically a full scan or the cluster.
- Solutions (either way, we may still end up doing full scan)
  - use range key(aka. sort key)
  - create global secondary index (reference)
Local persistent storage
- These (key, value) pairs are stored within that node using various storage systems depending on application needs. A few examples of such storage systems are:
  - BerkeleyDB Transactional Data Store
  - MySQL (for large objects)
  - An in-memory buffer (for best performance) backed by persistent storage

Implementation details

Optimistic replication

Coordinator node stores data first, then asynchronously replicates to next N nodes in the background. Replicas are not guaranteed to be identical at all times.

If a client cannot contact the coordinator node, it sends the request to a node holding a replica.

Add & remove nodes

Add nodes, aka. bootstrapping.
- Define how many VNodes the new machine is responsible
- Allocation algorithm in the new node pick random VNodes(reprented as tokens).
- New node requests current replicas of those tokens to stream data.
Remove nodes.
- Send command to remove a node.
- The old node reassigns tokens to other nodes. Also replicates data to those nodes.
Reference: https://cassandra.apache.org/doc/latest/cassandra/operating/topo_changes.html

Sloppy quorum

Preference list
- The list of nodes responsible for storing a particular key. E.g. for key "K", preference list is [server 1, 2, 3, 4] where 4 is to store hinted replica.
Sloppy quorum ensures "always writable"
- When a node is unreachable, another node can accept writes on its behalf. The write is then kept in a local buffer and sent out once the destination node is reachable again.
Drawbacks: data conflict

Vector clock

Clock skew
- In distributed system, we cannot assume that wall clock time t on node a happened before time t + 1 on node b.
- Hardware solution: GPS unit, atomic clock.
Dynamo returns conflicting versions to client who are responsible to explicitly reconcile the conflict.
Alternative way to handle conflict
- Model data as Conflict-free replicated data types (CRDTs). E.g. adding item A, B to shopping cart can be handled anywhere anytime, because the end result is 2 items in the cart anyway.
  - AKA strong eventual consistency
  - Downsides: not easy to model every data as CRDT
- Last-write-wins (LWW).
  - Downsides
    using wall clock doesn't guarantee a real last write.
    can easily lost data, e.g. for 2 concurrent data, 1 of them is thrown away.

Put & Get operation

Choosing coordinator node
- Via load balancer
  - Pros: system loosely coupled, helps scalability
  - Cons: extra hop increase latency, waste of resources
- Dynamo use client library that maintains server address
  - Pros: zero-hop DHT, low latency
  - Cons: less control of load distribution
Quorum
- Read + Write > N(replicas)
- A Common (N, R, WN,R,W) configuration used by Dynamo is (3, 2, 2).
  - (3, 3, 1): fast WW, slow RR, not very durable
  - (3, 1, 3): fast RR, slow WW, durable
Write(put) process
- coordinate node generate vector clock
- saves data in local
- sends write request to N-1 nodes from preference list
- returns successful after receiving W-1W−1 confirmation
Read(get) process
- coordinate node request data from N-1 nodes from preference list
- wait until R-1 replies
- handle data versions (then update it back to nodes, aka. read repair)
- returns all relevant data versions(may have conflicts) to the caller
State machine
- Each client request results in creating a state machine on the node. (like request context?)
- Coordinate node selection
  - Any top N nodes from preference list
  - A usual request patter is read-then-write, so a node replied fastest in previous read operation is chosen to handle write. The info stored in request context. This increases the chances of getting “read-your-writes” consistency.

Anti-entropy Through Merkle Trees

Read repair: use vector clock to resolve conflicts. But it's too slow if a replica falls too behind.
Merkle trees is used to resolve conflicts in the background.
Workflow
- Data is split into smaller parts. Merkle trees is a binary tree of hashes.
- Compare 2 tree from root to leaves. If encounter difference, resolve conflicts in this small subtree. Thus data transmission and work load is minimized.
Drawbacks: have to recalculate the tree when data changes.

Gossip protocol

Peer-to-peer communication
Each node periodically exchange info(basically the copy of hash ring) with random nodes.
Use seed nodes to bootstrap. More read on Uber's Ringpop.

Inspired by Dynamo: Cassandra and Riak

Criticism on Dynamo

Each Dynamo node contains the entire Dynamo routing table. This is likely to affect the scalability.
Security concerns, e.g. no ACL.
Dynamo’s design can be described as a “leaky abstraction,” where client applications are often asked to manage inconsistency.

Reference

https://www.educative.io/courses/grokking-adv-system-design-intvw/xoEXr9614RB

PreviousBigtable NextCassandra

Last updated 3 years ago

Was this helpful?