To follow along with this tutorial, you should have working knowledge of the Unix command line and Bash. If you are not familiar with these, we have included links to several free tutorials below. Note that they are sorted by time investment and in ascending order.
- Unix Basics for Bioinformatics
- Just Enough Unix for Bioinformatics
- Command Line Bootcamp
- Advanced Bash-Scripting Guide
You should also be somewhat familiar with Amazon Web Services (AWS), and have an AWS account. If you do not feel comfortable with using AWS, we recommend that you consider reading through this tutorial. An Azure account may also be used. You can learn more about using Azure here.
To run Aether, you will need a copy of Python (Version 2.7). If you have not already installed Python, we recommend installing Anaconda, which comes with Python and several core packages. Instructions for installing Anaconda are available here.
If you have Anaconda, dependencies for Aether are handled automatically during the installation process.
If you do not use Anaconda, you will have to install dependencies manually. This is not recommended. These dependencies are:
If you would like to build the documentation, you will also have to install MkDocs.
Configuration and Setup¶
In this section, you will find a detailed explanation of how to install or uninstall Aether. We also describe how to build the documentation page on your own machine.
To install Aether, you first need to acquire a copy of it from GitHub. To do this, you may either clone the git repository, or download an archive. To clone the git repository, run the following in the command line:
git clone email@example.com:kosticlab/aether.git
Alternatively, you can download an archive of Aether here. To unpack the archive, run the following:
tar -xvf aether-master.tar.gz mv aether-master aether
Once you have acquired a copy of Aether, open the directory for Aether. We herein refer to this directory, or folder, as the "Aether directory". If you have Anaconda installed, simply run:
If you do not have Anaconda installed, instead run:
If you would like to build the documentation for Aether on your machine, run the following commands while in the Aether directory:
Note that some additional instructions for viewing the docs will be printed to your console when you run this command.
If you would like to uninstall Aether from your machine, simply run the following commands while in the Aether directory:
Once the Aether package has been uninstalled, simply delete the Aether directory that you downloaded from GitHub.
This section contains a detailed overview of Aether and how it functions. It also contains explanations of the various data used by Aether.
Execution Modes for Aether¶
Aether can be run in several different modes. The three primary modes are interactive mode, non-interactive mode, and "dry run" mode. The interactive mode will prompt the user for the various inputs, whereas the non-interactive mode will pull this information from the command line arguments. Lastly, the "dry run" mode will show the user what resources Aether would have bid on, as well as their cost. However, "dry run" mode will not actually run anything on the cloud, and thus will not result in any cloud resources being used. Information about running Aether with these modes is located below.
When Aether is run in interactive mode, it will prompt you for the various program parameters. Because Aether's optimization method requires a number of parameters, the interactive mode is highly recommended for new users. To run interactive mode, simply run the following in the Aether directory:
aether --interactive [ARGS]
aether --I [ARGS]
The non-interactive mode will not prompt the user for any input information. Non-interactive mode is generally not recommended for new users. To run Aether in non-interactive mode, simply run the following in the Aether directory:
The dry-run mode will show the user what bids Aether suggests, but will not use any cloud resources. To run Aether in dry-run mode, simply run the following in the Aether directory:
aether --dry-run [ARGS]
Note that you can run the Dry-Run mode in interactive mode by just adding the
-I argument to this command.
Command Line Arguments¶
As always, additional information about command line arguments for may be found by running:
This will output the following:
Usage: aether [OPTIONS] The Aether Command Line Interface Options: -I, --interactive Enables interactive mode. --dry-run Runs Aether in dry-run mode. This shows what cloud computing resources Aether would use, but does not actually use them or perform any computation. -A, --input-file TEXT The name of a text file, wherein each line corresponds to an argument passed to one of the distributed batch jobs. -L, --provisioning-file TEXT Filename of the provisioning file. -P, --processors INTEGER The number of cores that each batch job requires -M, --memory INTEGER The amount of memory, in Gigabytes, that each batch job will require. -N, --name TEXT The name of the project. This should be unique, as an S3 bucket is created on Amazon for this project, and they must have unique names. -E, --key-ID TEXT Cloud CLI Access Key ID. -K, --key TEXT Cloud CLI Access Key. -R, --region TEXT The region/datacenter that the pipeline should be run in (e.g. "us-east-1"). -B, --bin-dir TEXT The directory with applications runnable on the cloud image that are dependencies for your batch jobs. Paths in your scripts must be reachable from the top level of this directory. -S, --script TEXT The script to be run for every line in input- file and distributed across the cluster. -D, --data TEXT The directory of any data that the job script will need to access. --help Show this message and exit.
Finding Account Information Required to Run Aether¶
Not all of the data that Aether needs to run can be securely accessed automatically. In particular, we do not access your private AWS account information, and instead require the user to input this information. We provide details on how to find this data in the sections below.
Access Key Information¶
Instructions for locating your AWS Access Key ID and Access Key can be found here.
Instance Limits Information¶
When you run Aether, you will be prompted for some information on account limits, as AWS does not allow them to be programmatically retrieved.
.gif below, we show a demonstration of where to access this information on the AWS website.
These account limits are automatically saved in the
instances.p file, and may be entered into the bidder on subsequent runs to save time.
Once you have entered account limits into Aether, it will begin solving the multi-objective optimization problem of selecting the optimal bidding strategy. This is a computationally intensive process, during which Aether is iteratively performing a number of high-dimensional convex optimizations. On a lightweight computer (e.g., an older laptop), this may impact performance of other programs that are running.
Cloud Resource Provisioning Information¶
After running Aether, you will find that it has generated a new file, named
This file contains the provisioning information for the batch processing pipeline, which is the second component of Aether.
We turn now to the details of
Each line in
prov.psv is a list delimited by the vertical bar character.
In order, the columns represent
BIDDABLE, which we explain in the table below.
|TYPE||The type of instance that is being requested. This is a name defined by the provider.|
|PROCESSORS||The number of processors that are available to this type of instance.|
|RAM||The amount of Random Access Memory (RAM) that is available to this type of instance.|
|STORAGE||Whether there is ephemeral storage available to this instance during use.|
|BIDDABLE||Whether this instance should be bid upon or purchased at its standard market price.|
As an example, if
prov.psv contains the following:
Then Aether would instruct the batch processing pipeline to spin up two
c4.large instances, purchasing one at market cost and bidding on the second one.
As an aside, it is possible to run the batch processing pipeline without the bidder if you already have
prov.psv (e.g., generated it manually, reusing it).
To do so, simply replace
YourFileHere.psv is the name of your
.psv file with the provisioning information in it.
Job Run-Time Environment and Resources Available by Default¶
Each time you submit a batch of jobs to Aether, it provides some information about the environment to the job that is being executed.
An example of this is seen below.
A copy of this template is also located in
bin folder that was uploaded to the cloud should be available to the job script at
Note that this is not the global
/bin, but instead relative to the job script's execution location.
#!/bin/bash PASSEDARGUMENT=$1 PROCS=$2 MEMFRAC=$3 PRIMARYHOST=$4 OUTPUTFOLDER=$5 LOCALIP=$6 # Code goes here (likely includes uploading of results to s3 bucket) # Access uploaded scripts, programs, and binaries e.g. "python #bin/uploaded_script_name.py" or "bin/upload_script_name.sh" #this script should always conclude with the line below in order to tell the master node to assign a new job. bin/terminate_job.sh $6 $1 $4
Details about these parameters are shown in the table below.
|PASSEDARGUMENT||The string representing an argument for an individual batch job|
|PROCS||The number of processors that this task can safely utilize.|
|MEMFRAC||The memory fraction of the current node that this task can safely utilize.|
|PRIMARYHOST||The static IP address of the node controlling batch processing. Will be notified upon task completion.|
|OUTPUTFOLDER||The location in cloud file system to store the job outputs.|
|LOCALIP||The local IP of this machine.|
Finally, note that any item in the DATA folder may be downloaded from the cloud file system (e.g., S3) during job run-time using the AWS or Azure Command Line Interface. Locations could be part of pre-generated arguments, as the S3 bucket is named the same as previously chosen project name.
Acessing Data Storage During Job Run-Time¶
To add storage to a batch node for jobs require large amounts of disk space, run:
bin/add_ebs_storage.sh SIZE DEVMNT
Details about these parameters are shown in the table below.
|SIZE||The number representing the size in GB of the SSD to attach.|
|DEVMNT||The location in /dev to be created and used to mount the disk.|
This utility should be run from within the batch job processing script from the previous section, likely including a check to see whether or not the desired disk has been previously mounted.
We have included several examples of using Aether below. Because distributed systems are inherently complex and nondeterministic, we strongly recommend reading through these examples before running Aether on your own.
Example: Basic Use¶
examples/basic there exists
test.sh. If the batch
processing script is run with interactive mode with
args.txt as the input
file argument and
test.sh as the script argument the bidder will be run and
on the distributed compute that is spun up each replica node will upon receipt of a task write the passed line from
args.txt to a file, wait for 30 seconds, upload the file to S3, and then
communicate to the primary node that the job is complete and that another job
can begin. If there are no new jobs then the instances terminate themselves
Output can be accessed in an S3 bucket that bears the same name as the project
passed in via the name argument, e.g.
s3://projectname. Logs that are
automatically generated are also placed in this bucket.
Example: Assembling Metagenomes (in paper)¶
This is the example use of the pipeline that is presented in the paper that
demonstrates more complex uses. This
example is located in
examples/metagenome_assembly. To run this example in
assemble_sample.py in a folder with built versions of
megahit and pass the location of this
folder as the bin-dir argument. Pass
batch_script.sh, which essentially is
just a wrapper to run the python script in
bin, as the script argument. Pass
the location of a folder containing fastq files of paired end metagenomic reads
as the data argument. Finally, for the input file argument, pass a text file
where each line contains the s3 locations (parameterized by name given to
project) of 2 matching paired end reads
separated by a comma, e.g.
s3://nameofproject/data/samplexpe1.fastq.gz,s3://nameofproject/data/samplexpe2.fast1.gz. For the analysis done in the paper 1572 samples were assembled but this script can be used on any number of samples.
This section discusses a number of Aether's useful, but advanced, features. However, many of these features have complexities or unique aspects that make them less straightforward or more involved than those discussed thus far. As such, it is recommended that you first read through the documentation above.
Add A Microsoft Azure Machine To A Running Batch Processing Pipeline¶
To add a Microsoft azure node to a currently executing Aether pipeline, run
bin/az.sh [args] where
the following 21 arguments in sequential order:
|3||Azure Subscription ID|
|4||Azure Resource Group Name|
|6||Azure Public IP Identifier|
|7||Azure DNS Identifier|
|8||Azure Virtual Network Name|
|9||Azure Subnet Name|
|10||Azure NIC Name|
|11||Azure Virtual Machine Name|
|12||AWS CLI Key ID|
|13||AWS CLI Key|
|15||Batch Jobs Azure VM Can Handle Concurrently|
|16||Processors Needed Per Batch Job|
|17||Fraction Of Node Memory That One Task Will Utilize|
|18||S3 Location Where Output Should Be Uploaded To|
|19||Static IP Of Primary Node|
|20||For Compatibility With Scripts For AWS Always Make This Argument "false"|
|21||Type Of Azure VM To Spin Up|
Add Your Local Machine To A Running Batch Processing Pipeline¶
Please note that it is a requirement that your hardware has a static IP address and is not being operated under any sort of scheduler.
To add your own hardware to a currently executing Aether pipeline, run
bin/local.sh [args] where
the following 9 arguments in sequential order:
|1||AWS CLI Key ID|
|2||AWS CLI Key|
|4||Batch Jobs Your Hardware Can Handle Concurrently|
|5||Processors Needed Per Batch Job|
|6||Fraction Of Node Memory That One Task Will Utilize|
|7||S3 Location Where Output Should Be Uploaded To|
|8||Static IP Of Primary Node|
|9||For Compatibility With Scripts For AWS Always Make This Argument "false"|