Bigtable

Why Bigtable?

Non-relational
- flexible schema, unlimited columns
Highly scalable
- low latency: ~10ms in-memory reads, 100ms disk read
- high throughput: 10k qps
Not good for
- relational data
- transaction
- small dataset

Flexible schema (Semi-structured/Sparse)
- Number of ColumnFamily is limited, but # of column is unlimited
- One can store data in column names
- Sparse: if a row doesn't have value(cell) for a column, that cell is not stored at all.
RowKey + ColumnFamily:ColumnName + VersionTimestamp identify a value: (row:string, column:string, time:int64) → string
- The Bigtable is mapped onto SSTables with a key that consists of (row, column, timestamp)
Row ordering
- key on SSTable: RowKey - ColumnFamily - ColumnName - Timestamp
- 1) RowKey: sorted lexicographically in ascending order (A->Z)
- 2) ColumnFamily: unsorted
- 3) Columns: ascending sorted within a ColumnFamily
- 4) Timestamp: latest -> oldest
Column
- ColumnName format: FamilyName:ColumnName, column name is a string with 0 or more characters. When the length is 0, it is a default column, e.g. FamilyName:
- FamilyName stored as numeric value, so doesn't impact storage too much. But column name will impact storage.
- ColumnFamily: unique, limited to 100
  - smallest granularity for: data type restriction, replication, ACL, Garbage collection
Some technical details
- Row data size limit
  - RowKey: 64kb limit, also used for index(METADATA table)
  - row data: around 200MB
  - single version of a cell: 100MB

Tablet
- a tablet holds a range of row key, at least 1 row
- consists of 0 or more SSTable
  - one locality group is stored together in a SSTable
- a table starts with 1 tablet, and split up (by tablet server) as data grows

Chubby:
- bookeeping live tablet servers(stored in small file)
  - initial tablet server lookup
  - start/kill a tablet server
- Handle metadat operation, e.g. create/update table
- master election, if there are multiple masters in a Bigtable cell
Master server
- health check tablet server
- assign tablets
- handle garbage collection
- client rarely interact with master
Tablet server
- server multiple tablets, from 10 to 10k.
- split tablet when data grows over limit (1GB by default)
- if a sever becomes too hot, it shifts some tablets to other servers, to distribute the load
- in memory
  - block cache: recently read SSTable blocks
  - memtable: holds recent mutations
  - bloom filter: one per locality group. if bloom filter returns false, the data is not there.
  - in-memory locality group: speed up read
GFS: file system to store actual data(SSTable). Tablet servers do not store data in local disk, other than caching in memory.
Tablet
- essentially a slice of a table, consists of multiple SSTables (stored in GFS)
- unit for load balancing accross servers
Client: talk to Tablet servers directly, after querying Chubby for location

3 level:

Root tablet(METADATA 0). File location stored in Chubby, actual file stored in GFS.
- Root tablet never split, to ensure 3-level index
METADATA 1: stores tablet location

Each tablet is assigned to one tablet server at a time.
Master(single) keeps track of tablet servers and assign tablets to servers
- Tablet server registers in Chubby file
- If a server fails, it lost the lock on Chubby. Master re-assign its tablets.
Master workflow
- 1) acquire master lock in Chubby
- 2) scan Chubby's server directory to find all live tablet servers
- 3) communicate with servers to know the tablets assignment
- 4) scan METADATA table to find and assign un-assigned tablets

Client asks Chubby where is the METADATA 0(stored in GFS)
Client reads MD0 from GFS -> reads MD1 -> finds address of tablet server
Client cache tablet locations in memory. Optimization: it reads more tablet locations than it needs.

Tablet server validate request format and auth
Query memtable + SSTable files -> merge data
- SSTables and the memtable are lexicographically sorted data structures, the merged view can be
  formed efficiently.

3 levels of compaction

Minor compaction: memtable to SSTable
Merging compaction: recent multiple SSTable to one SSTable
Major compaction: all SSTable to exactly one SSTable
- deletion happens in this stage

Bundle the data that you frequently access together into a single locality group for faster access. If not specified, all column families are in default locality group
Bigtable has in-memory locaility group that holds the data in memory for fast access.
Bundle similar data also helps compression.

Caching
- scan cache: key-value pairs, for clients that read the same data repeatly
- block cache: SSTable blocks, for clients read close data, e.g. sequential read, different columns of a hot locality group.
Bloom Filter: one per locality group, hosted in tablet server memory

a single commit log per tablet server
- improve write performance
optimization for failure recovery
- sort commit log in order of the keys <table, row name, log sequence number>
cope with GFS latency spikes
- each tablet server actually has two log writing threads, each writing to its own log file. only one of these two threads is actively in use at a time
- when one thread perform poorly, switch to the other
- use log sequence number in recovery process

Master moves a tablet to another server
source tablet do minor compaction to persist memtable to SSTable, thus reduce the work to process commit log

SSTable is immutable
- concurrency control is easy, i.e. don't need to worry about concurrency.
- delete data: set a tombstone. removing deleted data is transformed to garbage collecting obsolete SSTables.
- split tablet: children tablets share SSTables of parent tablet.
Memtable
- each row copy-on-write, so read and write can proceed in parallel.

Transactions implementation: actually read locks or write locks on specific values in a bigtable.

Last updated 3 years ago

Was this helpful?