Design Job Scheduler(Async task)
Senario
Functional requirements
Schedule tasks to execute at specified time, immediate or delayed execution.
Track task progress.
Users can set priorities for tasks.(Redundant if resources is sufficient, and it is conflict with "scheduled task" promise.)
System requirements
High availability
Tasks should be executed right at the scheduled time, within acceptable time range, e.g. seconds.
System is 99.9% available to accept tasks.
Failure recovery
If a task fails before completion, it should be retried.
Requirement to clients: task should be idempotent, i.e. retry should produce same result.
No duplicate/concurrent task execution.
Estimation
1k tasks insert request per second.
3k tasks poll request.
Service
User-facing API
scheduleTask(taskInfo, taskOptions), return taskId.
pollTask(taskId)
TaskService
TaskLibrary: interact with the Task system
TriggerService: monitor tasks & notify task execution
TaskWorker: claim/execute task, update status
Option 2: TaskUpdaterService: internal service handling update from workers.
pros: separate of concern, don't interference user interaction. Dropbox uses this option, whereas some Google internal frameworks use option 1.
cons: more maintenance cost. some duplicate common logic, e.g. Auth, rate limit, logging, etc.
Key designs
Life of a task
The user schedule a task by calling TaskService API.
TaskService do necessary validation(Auth, task info, etc.), then insert task to storage and return taskId to caller.
TriggerService periodically polls DB to find tasks that at their scheduled time, or retriable tasks, send them to Message Queue, and update status to DB.
Task Worker listen to Message Queue, claim task(acquire lease) via TaskService API, then start execution. It periodically update task status to TaskService(extend lease), as a heartbeat.
User can query task progress with taskId, which is handled by TaskService as well.
How to guarantee?
Exact once execution
Retry on failed task, unless marked unretriable.
No concurrent execution
TriggerService only handle tasks in relevant status. Exclusive lock on row upon updating task status.
Option 1: Workers terminate themselves after they fail to update status in-time, i.e. before lease end. Scenario: a task fails to get updates from worker before lease end -> we start retry execution -> previous worker still executing but just fail to send status update.
Option 2: When worker claim task, check if the task has a valid lease holding by other workers, if yes, reject.
pros: preferred. avoid abandoning things completed.
Storage
Task table
schema:
id, owner, task_info, task_option, creation_timestamp, status
NoSql is good, since no transaction requirement.
Scalability
TaskService
Replicas for TaskService + Load Balancer
Cache for read path
Task table
Replicas
Partition on taskId?
What if TriggerService is down?
Option 1: Server hot backup, redundant server. Daemon(or cluster management, ZooKeeper?) to detect failed server and switch to backup server.
cons: waste of resources, those backups can serve as well.
Option 2: replicas of TriggerService. Each only responsible for partial tasks: 1) shard Tasks table on taskId; 2) one TriggerService query specific users/taskTypes
Resources
PageDuty: https://www.youtube.com/watch?v=s3GfXTnzG_Y
focus on: data center failure recovery, task ordering
A mock interview: https://www.youtube.com/watch?v=Zs4O-Oo5aTc
Last updated