Bigtable
Paper: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Why Bigtable?
Non-relational
flexible schema, unlimited columns
Highly scalable
low latency: ~10ms in-memory reads, 100ms disk read
high throughput: 10k qps
Not good for
relational data
transaction
small dataset
Data model
Flexible schema (Semi-structured/Sparse)
Number of ColumnFamily is limited, but # of column is unlimited
One can store data in column names
Sparse: if a row doesn't have value(cell) for a column, that cell is not stored at all.
RowKey + ColumnFamily:ColumnName + VersionTimestamp identify a value:
(row:string, column:string, time:int64) → string
The Bigtable is mapped onto SSTables with a key that consists of (row, column, timestamp)
Row ordering
key on SSTable:
RowKey - ColumnFamily - ColumnName - Timestamp
1) RowKey: sorted lexicographically in ascending order (A->Z)
2) ColumnFamily: unsorted
3) Columns: ascending sorted within a ColumnFamily
4) Timestamp: latest -> oldest
Column
ColumnName format: FamilyName:ColumnName, column name is a string with 0 or more characters. When the length is 0, it is a default column, e.g.
FamilyName:
FamilyName stored as numeric value, so doesn't impact storage too much. But column name will impact storage.
ColumnFamily: unique, limited to 100
smallest granularity for: data type restriction, replication, ACL, Garbage collection
Some technical details
Row data size limit
RowKey: 64kb limit, also used for index(METADATA table)
row data: around 200MB
single version of a cell: 100MB
Tablet
a tablet holds a range of row key, at least 1 row
consists of 0 or more SSTable
one locality group is stored together in a SSTable
a table starts with 1 tablet, and split up (by tablet server) as data grows
Architecture
Chubby:
bookeeping live tablet servers(stored in small file)
initial tablet server lookup
start/kill a tablet server
Handle metadat operation, e.g. create/update table
master election, if there are multiple masters in a Bigtable cell
Master server
health check tablet server
assign tablets
handle garbage collection
client rarely interact with master
Tablet server
server multiple tablets, from 10 to 10k.
split tablet when data grows over limit (1GB by default)
if a sever becomes too hot, it shifts some tablets to other servers, to distribute the load
in memory
block cache: recently read SSTable blocks
memtable: holds recent mutations
bloom filter: one per locality group. if bloom filter returns false, the data is not there.
in-memory
locality group: speed up read
GFS: file system to store actual data(SSTable). Tablet servers do not store data in local disk, other than caching in memory.
Tablet
essentially a slice of a table, consists of multiple SSTables (stored in GFS)
unit for load balancing accross servers
Client: talk to Tablet servers directly, after querying Chubby for location
Implementation details
Tablet locations lookup
3 level:
Root tablet(METADATA 0). File location stored in Chubby, actual file stored in GFS.
Root tablet never split, to ensure 3-level index
METADATA 1: stores tablet location
Tablet Assignment
Each tablet is assigned to one tablet server at a time.
Master(single) keeps track of tablet servers and assign tablets to servers
Tablet server registers in Chubby file
If a server fails, it lost the lock on Chubby. Master re-assign its tablets.
Master workflow
1) acquire master lock in Chubby
2) scan Chubby's server directory to find all live tablet servers
3) communicate with servers to know the tablets assignment
4) scan METADATA table to find and assign un-assigned tablets
Tablet serving
Tablet discovery
Client asks Chubby where is the METADATA 0(stored in GFS)
Client reads MD0 from GFS -> reads MD1 -> finds address of tablet server
Client cache tablet locations in memory. Optimization: it reads more tablet locations than it needs.
Write path
mutation request arrives in server
server validatse format, auth(check ACL in Chubby, or client cache)
mutation persisted to commit log
mutation applied in memtable
Read path
Tablet server validate request format and auth
Query memtable + SSTable files -> merge data
SSTables and the memtable are lexicographically sorted data structures, the merged view can be
formed efficiently.
Compactions
3 levels of compaction
Minor compaction: memtable to SSTable
Merging compaction: recent multiple SSTable to one SSTable
Major compaction: all SSTable to exactly one SSTable
deletion happens in this stage
Refinement
Locality
Bundle the data that you frequently access together into a single locality group for faster access. If not specified, all column families are in
default
locality groupBigtable has
in-memory
locaility group that holds the data in memory for fast access.Bundle similar data also helps compression.
Compression
Compress SSTable block
Better compression: cluster similar data by designing row key
Read performance improvement
Caching
scan cache: key-value pairs, for clients that read the same data repeatly
block cache: SSTable blocks, for clients read close data, e.g. sequential read, different columns of a hot locality group.
Bloom Filter: one per locality group, hosted in tablet server memory
Commit log
a single commit log per tablet server
improve write performance
optimization for failure recovery
sort commit log in order of the keys <table, row name, log sequence number>
cope with GFS latency spikes
each tablet server actually has two log writing threads, each writing to its own log file. only one of these two threads is actively in use at a time
when one thread perform poorly, switch to the other
use log sequence number in recovery process
Speeding up tablet recovery
Master moves a tablet to another server
source tablet do minor compaction to persist memtable to SSTable, thus reduce the work to process commit log
Benefits of immutability
SSTable is immutable
concurrency control is easy, i.e. don't need to worry about concurrency.
delete data: set a tombstone. removing deleted data is transformed to garbage collecting obsolete SSTables.
split tablet: children tablets share SSTables of parent tablet.
Memtable
each row copy-on-write, so read and write can proceed in parallel.
Advanced feature in real word
Transactions implementation: actually read locks or write locks on specific values in a bigtable.
Reference
Last updated