.. _config: Configuration and Instance Selection ==================================== Cloud Tasks has a flexible configuration system. Options can be specified in a YAML-format configuration file or on the command line, and very few options are required for basic use. The configuration file supports global options for system configuration, options for selecting compute instances and running jobs, and provider-specific options including authentication and job options that can override the other options. A configuration file has the following structure (all sections are optional): .. code-block:: yaml [Global options] run: [Run options] aws: [AWS-specific options] [AWS-specific run options] gcp: [GCP-specific options] [GCP-specific run options] Global Options -------------- The available global options are: * ``provider``: The cloud provider to use (select one of ``aws`` or ``gcp``) The ``provider`` option must be specified either in the configuration file or on the command line. In addition to detemining which cloud provider to contact, it is used to determine which provider-specific options in the configuration file are relevant. Run Options ----------- Run options come in several flavors: * :ref:`config_compute_instance_options` * :ref:`config_number_of_instances_options` * :ref:`config_vm_options` * :ref:`config_boot_options` * :ref:`config_worker_and_manage_pool_options` .. _config_compute_instance_options: Options to select a compute instance type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generally speaking, within the constraints provided, the system will attempt to use the instance type with the lowest cost per vCPU with the maximum number of vCPUs per instance. This results in needing the fewest instances to get the job done, since each instance can do maximal work; this may or may not be an appropriate choice for your workload (for example, having a large number of vCPUs, and thus simultaneosly running tasks, may result in the tasks being throttled by the network or disk bandwidth). With no constraints, the system will tend to choose the cheapest (and probably worst-performing) instance type with the least memory, the least disk space, and the slowest disk type. *Thus, while no constraints are required, it is recommended to specify at least some minimal constraints to avoid selecting the worst possible instance type.* If you need specific performance, specify the instance types you are willing to accept as a regular expression. For example, to allow all GCP "N2" instances, specify ``instance_types: "^n2-.*"``. This will still give the system freedom to choose the best instance type within that family given the other constraints. Alternatively, you can specify ``cpu_family``, ``min_cpu_rank``, or ``max_cpu_rank`` if you don't want to look up the specific instance types that are relevant to your needs. For example, ``min_cpu_rank: 21`` will specify a fast processor (Intel Sapphire Rapids or better). Note that it is quite possible to over-constrain the system such that no instance types meet the requirements. Many attributes can be specified in multiple ways. For example, the minimum amount of memory can be specified using ``min_total_memory``, ``min_memory_per_cpu``, or ``min_memory_per_task``. Multiple constraints can be specified for the same attribute and the system will use the most-constraining value. To get a list of the available instance types and their attributes, including number of vCPUs, amount of memory, CPU family and performance rank, price, etc. you can use the :ref:`cli_list_instance_types` command line command. Include the ``--detail`` option to see all available attributes. General Constraints +++++++++++++++++++ * ``architecture``: The architecture to use; valid values are ``X86_64`` and ``ARM64`` (defaults to ``X86_64``) * ``cpu_family``: The CPU family to use, for example ``Intel Cascade Lake`` or ``AMD Genoa``. * ``min_cpu_rank``: The minimum CPU performance rank to use (0 is the slowest) * ``max_cpu_rank``: The maximum CPU performance rank to use (0 is the slowest) * ``instance_types``: A single instance type or list of instance types to use; instance types are specified using Python-style regular expressions (if no anchor character like ``^`` or ``$`` is specified, the given string will match any part of the instance type name) CPU +++ * ``min_cpu``: The minimum number of vCPUs per instance * ``max_cpu``: The maximum number of vCPUs per instance * Derived from instance task information (the number of CPUs = cpus_per_task * tasks_per_instance) * ``cpus_per_task``: The number of vCPUs per task (defaults to 1) * ``min_tasks_per_instance``: The minimum number of tasks per instance * ``max_tasks_per_instance``: The maximum number of tasks per instance Memory ++++++ * ``min_total_memory``: The minimum amount of memory in GB per instance * ``max_total_memory``: The maximum amount of memory in GB per instance * Per-CPU constraints * ``min_memory_per_cpu``: The minimum amount of memory in GB per vCPU * ``max_memory_per_cpu``: The maximum amount of memory in GB per vCPU * Per-task constraints (these are the same as the per-CPU constraints and simply use the ``cpus_per_task`` value as a conversion factor) * ``cpus_per_task``: The number of vCPUs per task (defaults to 1) * ``min_memory_per_task``: The minimum amount of memory in GB per task * ``max_memory_per_task``: The maximum amount of memory in GB per task SSD Storage +++++++++++ Some instance types have additional local SSD storage in addition to whatever volume is mounted as the boot disk and these constraints apply to them. By specifying a minimum SSD size you are also constraining the instance type to those that have an extra SSD attached. * ``min_local_ssd``: The minimum amount of local extra SSD storage in GB per instance * ``max_local_ssd``: The maximum amount of local extra SSD storage in GB per instance * Per-CPU constraints - the total amount of storage will be the sum of the base size and the product of the number of vCPUs and the per-CPU amount; the base size is optional, and defaults to 0 * ``local_ssd_base_size``: The amount of local extra SSD storage in GB present before allocating additional space per vCPU * ``min_local_ssd_per_cpu``: The minimum amount of local extra SSD storage in GB per vCPU * ``max_local_ssd_per_cpu``: The maximum amount of local extra SSD storage in GB per vCPU * Per-task constraints (these are the same as the per-CPU constraints and simply use the ``cpus_per_task`` value as a conversion factor) * ``cpus_per_task``: The number of vCPUs per task (defaults to 1) * ``local_ssd_base_size``: The amount of local extra SSD storage in GB present before allocating additional space per task * ``min_local_ssd_per_task``: The minimum amount of local extra SSD storage in GB per task * ``max_local_ssd_per_task``: The maximum amount of local extra SSD storage in GB per task Boot Disk +++++++++ The boot disk size and type is configurable at instance creation time and is not an intrinsic property of a provider's instance type. As such, there are no "constraints" on the boot disk size. Instead, there are simply ways to specify the size and type of the boot disk you want. The boot disk size can either be a single absolute value: * ``total_boot_disk_size``: The size of the boot disk in GB (defaults to 10 GB) or a per-CPU value: * ``boot_disk_base_size``: The amount of boot disk in GB present before allocating additional space per vCPU (defaults to 0) * ``boot_disk_per_cpu``: The amount of boot disk in GB per vCPU (defaults to 0) or a per-task value: * ``cpus_per_task``: The number of vCPUs per task (defaults to 1) * ``boot_disk_base_size``: The amount of boot disk in GB present before allocating additional space per task (defaults to 0) * ``boot_disk_per_task``: The amount of boot disk in GB per task (defaults to 0) If more than one size is specified, the maximum of the values will be used. If no values are specified, a default appropriate to the provider will be used. The boot disk type is provider-specific and can be a single type or a list of types: * ``boot_disk_types``: The type(s) of the boot disk to allow (defaults to all available types for the provider) Finally, some boot disk types require additional configuration: * ``boot_disk_iops``: For any boot disk type that supports it, the number of provisioned IOPS to request; this is an absolute value and is not scaled by the number of vCPUs or tasks * ``boot_disk_throughput``: For any boot disk type that supports it, the number of provisioned throughput in MB/s to request; this is an absolute value and is not scaled by the number of vCPUs or tasks .. _config_number_of_instances_options: Options to constrain the number of instances ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generally speaking, the system will attempt to use the maximum number of instances allowed based on the various ``max_`` constraints, and then will verify that the ``min_`` constraints have not been violated. Note that it is quite possible to over-constrain the system such that no number of instances meet the requirements. As with the instance type constraints, no constraints are required, but it is recommended to specify at least some minimal constraints so that you can maintain control over the size of your instance pool and the resulting costs. By default, the maximum number of instances is set to 10 to avoid excessive instance pool sizes, and the maximum price is set to $10 per hour to avoid runaway costs, but these can be overridden by specifying different values. Note that depending on the provider and your account setup, you may have quotas for the creation of specific instance types, and Cloud Tasks may attempt to violate these quotas if you do not give it sufficient constraints. * ``min_instances``: The minimum number of instances to use (defaults to 1) * ``max_instances``: The maximum number of instances to use (defaults to 10) * ``min_total_cpus``: The minimum total number of vCPUs to use * ``max_total_cpus``: The maximum total number of vCPUs to use * ``cpus_per_task``: The number of vCPUs per task (defaults to 1); this is also used to configure the worker process to limit the number of tasks that can be run simultaneously on a single instance * ``min_tasks_per_instance``: The minimum number of tasks per instance * ``max_tasks_per_instance``: The maximum number of tasks per instance * ``min_simultaneous_tasks``: The minimum number of tasks to run simultaneously * ``max_simultaneous_tasks``: The maximum number of tasks to run simultaneously * ``min_total_price_per_hour``: The minimum total price per hour to use * ``max_total_price_per_hour``: The maximum total price per hour to use (defaults to 10) .. _config_vm_options: Options to specify the type of VM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * ``use_spot``: Use spot instances instead of on-demand instances; spot instances are cheaper but may be terminated by the cloud provider with little notice and should only be used for fault-tolerant jobs .. _config_boot_options: Options to specify the boot process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * A startup script must be specified when creating new instances. It can be specified either directly inline in the configuration file, or by providing a path to a file containing the startup script. Either one can be used, but not both. * ``startup_script``: The startup script to use (this can not be overridden from the command line because it is assumed that any startup script would be too long to pass as a command line argument) * ``startup_script_file``: The path to a file containing the startup script * ``image``: The image to use for the VM. If no image is specified, the default image for the provider will be used. This is most commonly the latest release of Ubuntu 24.04 LTS. .. _config_worker_and_manage_pool_options: Options to specify the worker and manage_pool processes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * ``scaling_check_interval``: The interval in seconds to check for scaling opportunities (defaults to 60) * ``instance_termination_delay``: The delay in seconds to wait before terminating instances once the task queue is empty (defaults to 60); this should be set to a value much greater than ``max_runtime`` to avoid terminating instances that are still working on tasks. * ``max_runtime``: The maximum runtime for a task in seconds (defaults to 60); this is used to set the retry timeout in the task queue such that any task that takes longer than this is assumed to have had an internal error and should be set to a value significantly greater than the longest runtime expected for a task * ``retry_on_exit``: If True, tasks will be retried if the worker exits prematurely, e.g. due to a crash * ``retry_on_exception``: If True, tasks will be retried if the user function raises an unhandled exception * ``retry_on_timeout``: If True, tasks will be retried if they exceed the maximum runtime specified by ``max_runtime`` .. _config_provider_specific_options: Provider-Specific Options ------------------------- The available provider-specific options are: * All providers * ``job_id``: The ID of the job to run; required for all queue and job-related operations * ``queue_name``: The name of the task queue to use, derived from job ID if not specified; only use this in special circumstances * ``region``: The region to use, required for most operations; will be derived from the zone if not specified * ``zone``: The zone to use; if not specified, all zones in the region will be used * ``exactly_once_queue``: If True, task messages and events are guaranteed to be delivered exactly once to any recipient. If False (the default), messages will be delivered at least once, but could be delivered multiple times. The specific implications of this flag are provider-specific. * AWS * ``access_key``: The access key to use * ``secret_key``: The secret key to use * GCP * ``project_id``: The ID of the project to use; required for most operations * ``credentials_file``: The path to a file containing the credentials to use; if not specified, the default credentials will be used * ``service_account``: The service account to use; required for worker processes on cloud-based instances to have access to system resources In addition, all run options can be specified in a provider-specific section, in which case they will override the global run options, if any. Command Line Overrides ---------------------- You can specify or override any configuration value from the command line unless otherwise noted. Simple replace any ``_`` character with ``-``: .. code-block:: bash python -m cloud_tasks run \ --config config.yaml \ --task-file tasks.json \ --provider aws \ # Specify/override provider setting --min-cpu 8 \ # Specify/override min_cpu setting --min-memory-per-cpu 16 \ # Specify/override min_memory_per_cpu setting --total-boot-disk-size 100 \ # Specify/override total_boot_disk_size setting --image ami-0123456789abcdef0 \ # Specify/override image setting --job-id my-processing-job \ # Specify/override job_id setting --instance-types t3- m5- # Specify/override instance_types and # restrict to t3 and m5 instance families .. note:: The priority of settings is: Command Line > Provider-Specific Config > Global Run Config > System Defaults You will be notified when overrides occur. For example: .. code-block:: text run: min_cpu: 2 gcp: min_cpu: 8 2025-06-03 14:04:55.668 - cloud_tasks.common.config - WARNING - Overriding run.min_cpu=2 with gcp.min_cpu=8 or: .. code-block:: text $ cloud_tasks manage_pool --config config.yml --min-cpu 16 2025-06-03 14:04:33.848 - cloud_tasks.common.config - WARNING - Overloading run.min_cpu=2 with CLI=16 Examples -------- The Simplest Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~ For GCP, the simplest configuration useable for all functions consists of a provider name, a job ID, a project ID, a region, and a startup script. .. code-block:: yaml provider: gcp gcp: job_id: my-processing-job project_id: my-project-id region: us-central1 startup_script: | #!/bin/bash echo "Hello, world!" .. code-block:: bash $ cloud_tasks manage_pool --config config.yaml Given the lack of :ref:`configuration options to constrain the instance type `, the system will select the ``e2-highcpu-32`` instance type. This is the lowest-memory version of GCP's most economical instance type, costing $0.02475/vCPU/hour as of this writing. It selects the 32-vCPU version, which is the maximum number of vCPUs available in a single instance for the ``e2`` family, because the cost of the boot disk (which is per-instance instead of per-vCPU) is amortized over the greatest number of vCPUs. However, the lack of :ref:`configuration options to constain the number of instances ` means the system will create the default maximum number of instances, 10, which will result in the creation of 320 vCPUs and a burn rate of $7.92/hour, which may be more than required depending on the actual workload. Note that in addition to the default maximum number of instances being 10, the default maximum total price per hour is $10.00, which is designed to limit the user's exposure to a high burn rate without explicitly asking for it. With the exception of the startup script, this could also be specified entirely on the command line: .. code-block:: yaml gcp: startup_script: | #!/bin/bash echo "Hello, world!" .. code-block:: bash $ cloud_tasks manage_pool \ --config config.yaml \ --provider gcp \ --job-id my-processing-job \ --project-id my-project-id \ --region us-central1 If the startup script was present in a file, no configuration file would be needed at all: .. code-block:: bash $ cloud_tasks manage_pool \ --provider gcp \ --job-id my-processing-job \ --project-id my-project-id \ --region us-central1 \ --startup-script-file startup.sh Constraining the Instance Type and Containing Costs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example uses more sophisticated constraints to limit the instance types and number of instances to use. First, we want to use slightly higher-performance processors and choose the ``n`` series using a balanced persistent boot disk. We want to limit instance types to those that have at least 8 but not more than 40 vCPUs; we might choose these numbers to balance parallelism with the network and disk bandwidth available on a single instance. At the same time, we know that our tasks are themselves parallel internally, and require 4 vCPUs per task for optimal performance. They also require memory of at least 32 GB per task. Finally, since we have a large number of tasks to process but our task code is still experimental, we are concerned about starting too many instances at once and thus having a high burn rate in case something goes wrong and we want to stop the job in the middle when we detect a problem. We set limits of 20 instances total, 100 simultaneous tasks, and a burn rate of $15.00 per hour. Whichever of these is most constraining will determine the total number of instances that will be started. .. code-block:: yaml provider: gcp gcp: job_id: my-processing-job project_id: rfrench region: us-central1 instance_types: ["^n2-.*", "^n3-.*", "^n4-.*"] min_cpu: 8 max_cpu: 40 cpus_per_task: 4 min_memory_per_task: 32 max_instances: 20 max_simultaneous_tasks: 100 max_total_price_per_hour: 15.00 boot_disk_types: pd-balanced startup_script: | #!/bin/bash echo "Hello, world!" In this case, the system starts by looking at all available ``n2-``, ``n3-``, and ``n4-`` instance types that meet our vCPU and memory constraints while minimizing price per vCPU. This results in the selection of ``n4-highmem-32`` as the optimal instance type with the lowest cost of $0.0622/vCPU/hour while supporting the most vCPUs in a single instance. For the number of instances, the system starts with the maximum allowed, 20. However, with a maximum of 100 simultaneous tasks, 32 vCPUs, and 4 vCPUs per task, this is reduced to 12. Finally, at a cost of $1.99/hour for each instance, the price limit of $15.00 per hour sets the final number of instances to 7 for a total cost of $13.93/hour.