Introduction to Cloud Tasks
Introduction
Cloud Tasks (contained in the rms-cloud-tasks
package) is a framework for running
independent tasks on cloud providers with automatic compute instance and task queue
management. It is specifically designed for running the same code multiple times in a
batch environment to process a series of different inputs. For example, the program could
be an image processing program that takes the image filename as an argument, downloads the
image from the cloud, performs some manipulations, and writes the result to a cloud-based
location. It is very important that the tasks are completely independent; no communication
between them is supported. Also, the processing happens entirely in a batch mode: a
certain number of compute instances are created, they all process tasks in parallel, and
then the compute instances are destroyed.
rms-cloud-tasks
is a product of the PDS Ring-Moon Systems Node.
Features
Cloud Tasks is extremely easy to use with a simple command line interface and straightforward configuration file. It supports AWS and GCP compute instances and queues along with the ability to run jobs on a local workstation, all using a provider-independent API. Although each cloud provider has implemented similar functionality as part of their offering (e.g. GCP’s Cloud Batch), Cloud Tasks is unique in that it unifies all supported providers into a single, simple, universal system that does not require learning the often-complicated details of the official full-featured services.
Cloud Tasks consists of four primary components:
A Python module to make parallel execution simple
Allows conversion of an existing Python program to a parallel task with only a few lines of code
Supports both cloud compute instance and local machine environments
Executes each task in its own process for complete isolation
Reads task information from a cloud-based task queue or directly from a local file
Monitors the state of spot instances to notify tasks of upcoming preemption
A command line interface to manage the task queue system, that allows
Loading of tasks from a JSON or YAML file
Checking the status of a queue
Purging a queue of remaining tasks
Deleting a queue entirely
A command line interface to query the cloud about available resources, given certain constraints
Types of compute instances available, including price (both demand and spot instances)
VM boot images available
Regions and zones
A command line interface to manage a pool of compute instances optimized for price, given certain constraints
Automatically finds the optimal compute instance type given pricing and other constraints
Automatically determines the number of simultaneous instances to use
Creates new instances and runs a specified startup script to execute the task manager
Monitors instances for failure or preemption and creates new instances as needed to keep the compute pool full
Detects when all jobs are complete and terminates the instances
Installation
cloud_tasks
consists of a command line interface (called cloud_tasks
) and a Python
module (also called cloud_tasks
). They are available via the rms-cloud-tasks
package
on PyPI and can be installed with:
pip install rms-cloud-tasks
Note that this will install cloud_tasks
into your current system Python, or into your
currently activated virtual environment (venv), if any.
If you already have the rms-cloud-tasks
package installed but wish to upgrade to a
more recent version, you can use:
pip install --upgrade rms-cloud-tasks
You may also install cloud_tasks
using pipx
, which will isolate the installation from
your system Python without requiring the creation of a virtual environment. To install
pipx
, please see the installation
instructions. Once pipx
is available, you
may install cloud_tasks
with:
pipx install rms-cloud-tasks
If you already have the rms-cloud-tasks
package installed with pipx
, you may
upgrade to a more recent version with:
pipx upgrade rms-cloud-tasks
Using pipx
is only useful if you want to use the command line interface and not access
the Python module; however, it does not require you to worry about the Python version,
setting up a virtual environment, etc.
Basic Examples
The cloud_tasks
command line program supports many useful commands that control the task
queue, compute instance pool, and retrieve general information about the cloud in a
provider-indepent manner. A few examples are given below.
To get a list of available commands:
cloud_tasks --help
To get help on a particular command:
cloud_tasks load_queue --help
To list all ARM64-based compute instance types that have 2 to 4 vCPUs and at most 4 GB memory per vCPU.
cloud_tasks list_instance_types \
--provider gcp --region us-central1 \
--min-cpu 2 --max-cpu 4 --arch ARM64 --max-memory-per-cpu 4
To load a JSON file containing task descriptions into the task queue:
cloud_tasks load_queue \
--provider gcp --region us-central1 --project-id my-project \
--job-id my-job --task-file mytasks.json
To start automatic creation and management of a compute instance pool:
cloud_tasks manage_pool --provider gcp --config myconfig.yaml
Contributing
Information on contributing to this package can be found in the Contributing Guide.
Links
Licensing
This code is licensed under the Apache License v2.0.
Contents:
- Introduction to Cloud Tasks
- Quick Start Guide
- Step 1: Install the Cloud Tasks CLI and Python Module
- Step 2: Modify Your Code to be a Worker
- Step 3: Create a Task File
- Interlude - Running Tasks Locally
- Step 4: Create a Startup Script
- Step 5: Create a Configuration File
- Step 6: Load the Task Queue and Run the Worker
- Step 7: Monitor the Job
- Step 8: Stop the Job
- Step 9: Purge the Task Queue
- Configuration and Instance Selection
- Command Line Interface Reference
- Provider-Specific Documentation
- Examples
- Writing a Worker Task
- The Worker API
- Principles for Greatest Success (and Least Frustration)