Symbolic Links

When running a Capsule, Data Assets are mounted directly into the Capsule’s data folder. Whereas in a Pipeline, Internal Data Assets are mounted to a /tmp directory and made available in the Capsule’s data folder via a symbolic link. See the following line from a Pipeline’s main.nf:

ln -s "/tmp/data/reads/$path1" "capsule/data/$path1"

This means that certain commands that work in a Capsule may not work in a Pipeline if they don’t follow symbolic links. For example, the following code lists the full path to every file in the data folder. It returns the expected output in a Capsule but returns nothing when running the Capsule in a Pipeline.

for file in $(find ../data -name "*"); do echo $file done

In this case the -L flag must be used which indicates the find command should follow symlinks.

for file in $(find -L ../data -name "*"); do echo $file done

AWS Secrets

Selecting an AWS IAM role is also mandatory when using External Data in a Pipeline.

The /root/capsule/data Directory Does Not Exist

In a Capsule there are a variety of ways to reference the data folder, i.e. /data, ../data from the code folder, or /root/capsule/data. In a Pipeline /root does not exist, so /data or ../data must be used.

Data Copying for External Data

The Map Paths menu can be used to improve Pipeline performance by ensuring only necessary data is copied. In this example, only the 494.54 MB image_volumes folder will be copied to the Capsule instance instead of the entire 450.21 GB abc_atlas Data Asset.

In contrast, Internal Data Assets are optimized to provide improved performance. The underlying EFS storage is mounted directly to the AWS Batch machine(s) running the computation, allowing Internal Data Assets to be used without any data copying. As such, it’s a best practice to use Internal Data Assets for data that remains unchanged across Pipeline runs.