Map Paths
The Map Paths menu provides control over the flow of data from Data Asset to Capsule and between Capsules. It can be opened by clicking the gear icon ⚙️️ on any connection in the Pipeline.
Was this helpful?
The Map Paths menu provides control over the flow of data from Data Asset to Capsule and between Capsules. It can be opened by clicking the gear icon ⚙️️ on any connection in the Pipeline.
Was this helpful?
From the Map Paths menu, the source and destination paths can be changed and a Connection Type can be selected. Configuring this menu properly will ensure that each Capsule receives the necessary data and that the Pipeline is optimized for parallelization.
This section will cover the following:
Users can customize the flow of data in a Pipeline by specifying source and destination paths. Source paths define which files should be transferred to the destination Capsule, while destination paths specify where these files should be stored in the destination Capsule. Multiple mappings can be used to provide additional configuration.
For example, Capsule A generates many different types of files in its /results
folder. Using the following source and destination mappings, all files with the extension .zip
will be sent to Capsule B's /data
folder in a folder called /zip_files
and all files with the extension .html will be sent to Capsule B's /data
folder in a folder called /html_files
. Any file without a .zip or .html extension will be ignored by Capsule B.
Data Asset to Capsule: each item will be distributed to a parallel instance of the Capsule.
Capsule to Capsule: a destination Capsule instance will be executed for every instance of the source Capsule.
Items may be passed in a different order than they appear in the Data Asset.
Data Asset to Capsule: the entire Data Asset will be available to all parallel instances of the destination Capsule.
Capsule to Capsule: all of the source data will be available to all parallel instances of the destination Capsule.
If there is only one input (Data Asset or Capsule) to the destination Capsule and the Connection Type is Collect, there will only be one instance of the destination Capsule.
If the source data consists of a single item which is needed by all instances of the destination Capsule, Collect should be used. Otherwise only one instance of the destination Capsule will receive the source data.
Collect was formerly called Global.
Capsule to Capsule: the source data will be split such that every item is passed separately into parallel instances of the destination Capsule.
Items may be passed in a different order than they appear in the Data Asset.
This example shows how items from two Data Assets will be distributed across parallel instances of a Capsule.
The left side shows the Pipeline schematic where two Data Assets are connected to a single Capsule. Each Data Asset contains 3 items; nums
contains 3 files and alpha
contains 2 files and 1 folder.
On the right is a table showing the distribution of input items across parallel Capsule instances for each connection type combination.
When the connection type is set to default
, regardless of the Data Asset type, items within the Data Asset are distributed to parallel instances of the Capsule. When the connection type is set to collect
, items in internal vs external data are distributed differently:
Internal: the entire Data Asset is passed to the Capsule’s data folder. In this example, Capsule A will receive internal/file1.txt
in its data folder.
External: only items in the Data Asset are passed to the Capsule’s data folder. In this example, Capsule B will receive just file2.txt
in its data folder.
This example shows how results generated by a source Capsule (Capsule A) affect the execution of a destination Capsule (Capsules B-F) when different connection types are used. It also shows how source mappings can be used in combination with connection types to further customize the Pipeline execution.
In this example, there are 3 parallel instances of Capsule A, which each produce the same output. The number of Capsule icons represents the number of parallel instances.
When many Capsules are connected to the Results Bucket, it's helpful to write each Capsule's output to a uniquely named folder. This can be achieved by opening the Map Paths menu and adding a folder name to the destination path.
For example, Capsule B's results will be written to a folder called Capsule_B
:
If there are multiple instances of Capsule B each producing results with the same name, the Pipeline will fail unless Generate indexed folders is used.
The Map Paths menu between a Capsule and the Results Bucket has a Generate indexed folders switch. By default, files will be written to the Results Bucket in the same way they’re written to the Capsule’s /results
folder. Turning on Generate indexed folders will write files to a unique folder for each instance of the source Capsule.
For example, if there are four folders in the Data Asset, passed to the Capsule with the Default connection type, and then passed to the Results Bucket with Generate indexed folder on, results from each instance will be written into separate folders named 1
, 2
, 3
and 4
.
Files from folders and subfolders of the source (Data Asset or Capsule results) can be passed to the destination Capsule without preserving directory structure by adding **
to the source path. This is particularly useful when combined with the Flatten Connection Type (see Capsule F in ).