Design Job Scheduler(Async task)

Senario

Functional requirements

  • Schedule tasks to execute at specified time, immediate or delayed execution.

  • Track task progress.

  • Users can set priorities for tasks.(Redundant if resources is sufficient, and it is conflict with "scheduled task" promise.)

System requirements

  • High availability

    • Tasks should be executed right at the scheduled time, within acceptable time range, e.g. seconds.

    • System is 99.9% available to accept tasks.

  • Failure recovery

    • If a task fails before completion, it should be retried.

    • Requirement to clients: task should be idempotent, i.e. retry should produce same result.

  • No duplicate/concurrent task execution.

Estimation

  • 1k tasks insert request per second.

  • 3k tasks poll request.

Service

  • User-facing API

    • scheduleTask(taskInfo, taskOptions), return taskId.

    • pollTask(taskId)

  • TaskService

  • TaskLibrary: interact with the Task system

  • TriggerService: monitor tasks & notify task execution

  • TaskWorker: claim/execute task, update status

  • Option 2: TaskUpdaterService: internal service handling update from workers.

    • pros: separate of concern, don't interference user interaction. Dropbox uses this option, whereas some Google internal frameworks use option 1.

    • cons: more maintenance cost. some duplicate common logic, e.g. Auth, rate limit, logging, etc.

Key designs

Life of a task

  1. The user schedule a task by calling TaskService API.

  2. TaskService do necessary validation(Auth, task info, etc.), then insert task to storage and return taskId to caller.

  3. TriggerService periodically polls DB to find tasks that at their scheduled time, or retriable tasks, send them to Message Queue, and update status to DB.

  4. Task Worker listen to Message Queue, claim task(acquire lease) via TaskService API, then start execution. It periodically update task status to TaskService(extend lease), as a heartbeat.

  5. User can query task progress with taskId, which is handled by TaskService as well.

How to guarantee?

  • Exact once execution

    • Retry on failed task, unless marked unretriable.

  • No concurrent execution

    • TriggerService only handle tasks in relevant status. Exclusive lock on row upon updating task status.

    • Option 1: Workers terminate themselves after they fail to update status in-time, i.e. before lease end. Scenario: a task fails to get updates from worker before lease end -> we start retry execution -> previous worker still executing but just fail to send status update.

    • Option 2: When worker claim task, check if the task has a valid lease holding by other workers, if yes, reject.

      • pros: preferred. avoid abandoning things completed.

Storage

  • Task table

    • schema: id, owner, task_info, task_option, creation_timestamp, status

    • NoSql is good, since no transaction requirement.

Scalability

  • TaskService

    • Replicas for TaskService + Load Balancer

    • Cache for read path

  • Task table

    • Replicas

    • Partition on taskId?

  • What if TriggerService is down?

    • Option 1: Server hot backup, redundant server. Daemon(or cluster management, ZooKeeper?) to detect failed server and switch to backup server.

      • cons: waste of resources, those backups can serve as well.

    • Option 2: replicas of TriggerService. Each only responsible for partial tasks: 1) shard Tasks table on taskId; 2) one TriggerService query specific users/taskTypes

Resources

Last updated