Bigtable

Paper: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Why Bigtable?

  • Non-relational

    • flexible schema, unlimited columns

  • Highly scalable

    • low latency: ~10ms in-memory reads, 100ms disk read

    • high throughput: 10k qps

  • Not good for

    • relational data

    • transaction

    • small dataset

Data model

  • Flexible schema (Semi-structured/Sparse)

    • Number of ColumnFamily is limited, but # of column is unlimited

    • One can store data in column names

    • Sparse: if a row doesn't have value(cell) for a column, that cell is not stored at all.

  • RowKey + ColumnFamily:ColumnName + VersionTimestamp identify a value: (row:string, column:string, time:int64) → string

    • The Bigtable is mapped onto SSTables with a key that consists of (row, column, timestamp)

  • Row ordering

    • key on SSTable: RowKey - ColumnFamily - ColumnName - Timestamp

    • 1) RowKey: sorted lexicographically in ascending order (A->Z)

    • 2) ColumnFamily: unsorted

    • 3) Columns: ascending sorted within a ColumnFamily

    • 4) Timestamp: latest -> oldest

  • Column

    • ColumnName format: FamilyName:ColumnName, column name is a string with 0 or more characters. When the length is 0, it is a default column, e.g. FamilyName:

    • FamilyName stored as numeric value, so doesn't impact storage too much. But column name will impact storage.

    • ColumnFamily: unique, limited to 100

      • smallest granularity for: data type restriction, replication, ACL, Garbage collection

  • Some technical details

    • Row data size limit

      • RowKey: 64kb limit, also used for index(METADATA table)

      • row data: around 200MB

      • single version of a cell: 100MB

  • Tablet

    • a tablet holds a range of row key, at least 1 row

    • consists of 0 or more SSTable

      • one locality group is stored together in a SSTable

    • a table starts with 1 tablet, and split up (by tablet server) as data grows

Architecture

  • Chubby:

    • bookeeping live tablet servers(stored in small file)

      • initial tablet server lookup

      • start/kill a tablet server

    • Handle metadat operation, e.g. create/update table

    • master election, if there are multiple masters in a Bigtable cell

  • Master server

    • health check tablet server

    • assign tablets

    • handle garbage collection

    • client rarely interact with master

  • Tablet server

    • server multiple tablets, from 10 to 10k.

    • split tablet when data grows over limit (1GB by default)

    • if a sever becomes too hot, it shifts some tablets to other servers, to distribute the load

    • in memory

      • block cache: recently read SSTable blocks

      • memtable: holds recent mutations

      • bloom filter: one per locality group. if bloom filter returns false, the data is not there.

      • in-memory locality group: speed up read

  • GFS: file system to store actual data(SSTable). Tablet servers do not store data in local disk, other than caching in memory.

  • Tablet

    • essentially a slice of a table, consists of multiple SSTables (stored in GFS)

    • unit for load balancing accross servers

  • Client: talk to Tablet servers directly, after querying Chubby for location

Implementation details

Tablet locations lookup

3 level:

  • Root tablet(METADATA 0). File location stored in Chubby, actual file stored in GFS.

    • Root tablet never split, to ensure 3-level index

  • METADATA 1: stores tablet location

Tablet Assignment

  • Each tablet is assigned to one tablet server at a time.

  • Master(single) keeps track of tablet servers and assign tablets to servers

    • Tablet server registers in Chubby file

    • If a server fails, it lost the lock on Chubby. Master re-assign its tablets.

  • Master workflow

    • 1) acquire master lock in Chubby

    • 2) scan Chubby's server directory to find all live tablet servers

    • 3) communicate with servers to know the tablets assignment

    • 4) scan METADATA table to find and assign un-assigned tablets

Tablet serving

Tablet discovery

  • Client asks Chubby where is the METADATA 0(stored in GFS)

  • Client reads MD0 from GFS -> reads MD1 -> finds address of tablet server

  • Client cache tablet locations in memory. Optimization: it reads more tablet locations than it needs.

Write path

  • mutation request arrives in server

  • server validatse format, auth(check ACL in Chubby, or client cache)

  • mutation persisted to commit log

  • mutation applied in memtable

Read path

  • Tablet server validate request format and auth

  • Query memtable + SSTable files -> merge data

    • SSTables and the memtable are lexicographically sorted data structures, the merged view can be

      formed efficiently.

Compactions

3 levels of compaction

  • Minor compaction: memtable to SSTable

  • Merging compaction: recent multiple SSTable to one SSTable

  • Major compaction: all SSTable to exactly one SSTable

    • deletion happens in this stage

Refinement

Locality

  • Bundle the data that you frequently access together into a single locality group for faster access. If not specified, all column families are in default locality group

  • Bigtable has in-memory locaility group that holds the data in memory for fast access.

  • Bundle similar data also helps compression.

Compression

  • Compress SSTable block

  • Better compression: cluster similar data by designing row key

Read performance improvement

  • Caching

    • scan cache: key-value pairs, for clients that read the same data repeatly

    • block cache: SSTable blocks, for clients read close data, e.g. sequential read, different columns of a hot locality group.

  • Bloom Filter: one per locality group, hosted in tablet server memory

Commit log

  • a single commit log per tablet server

    • improve write performance

  • optimization for failure recovery

    • sort commit log in order of the keys <table, row name, log sequence number>

  • cope with GFS latency spikes

    • each tablet server actually has two log writing threads, each writing to its own log file. only one of these two threads is actively in use at a time

    • when one thread perform poorly, switch to the other

    • use log sequence number in recovery process

Speeding up tablet recovery

  • Master moves a tablet to another server

  • source tablet do minor compaction to persist memtable to SSTable, thus reduce the work to process commit log

Benefits of immutability

  • SSTable is immutable

    • concurrency control is easy, i.e. don't need to worry about concurrency.

    • delete data: set a tombstone. removing deleted data is transformed to garbage collecting obsolete SSTables.

    • split tablet: children tablets share SSTables of parent tablet.

  • Memtable

    • each row copy-on-write, so read and write can proceed in parallel.

Advanced feature in real word

  • Transactions implementation: actually read locks or write locks on specific values in a bigtable.

Reference

Last updated