Principles for Greatest Success (and Least Frustration)
Use these principles to ensure your worker code will function as expected, is observable, and can be debugged:
Keep the function stateless; there should be no communication between tasks.
Handle all exceptions at the top level; if the program has to exit, be sure to return the appropriate status.
Return informative result messages to ease observation and debugging.
Clean up temporary files.
Release any system resources that will not automatically be released on process termination.
Log errors with sufficient context, such as complete tracebacks.
Test your worker code in a local environment before attempting to run it on the cloud. This can be done by running the worker code directly with
python <my_worker.py>using command line options to specify the needed parameters, including the use of a local task file:python my_worker.py --task-file my_tasks.json
You can also simulate a spot termination notice and subsequent forced shutdown of the compute instance.
When first testing in the cloud, use a single, cheap instance that runs only a single task a time. Specifying a small number of CPUs will automatically choose the cheapest option:
cloud_tasks run --task-file my_tasks.json --max-cpu 1 --max-instances 1
When developing the startup script, use the console logging system for the given provider to watch the commands being executed to make sure the script is executing as expected.
If you are copying data to the instance to be shared amoung all tasks, either do it during the startup script or allow your tasks to do it on-demand in a multi-processor-safe manner.
If you want to preload large amounts of data, you can instead create custom a boot image that already has the data loaded on the boot disk.
When writing results to a file, be sure to separate them in a way that makes them specific to the current task. The FileCache package <https://rms-filecache.readthedocs.io/> is a good way to do this, using the unique Task ID as the name of the cache.
Always specify one or more boot disk types that you are willing to use. If you don’t, the cheapest type will be chosen, which may be a slow HDD.