Map Paths
The Map Paths menu provides control over the flow of data from data asset to capsule and between capsules. It can be opened by clicking the gear icon ⚙️️ on any connection in the pipeline.
Was this helpful?
The Map Paths menu provides control over the flow of data from data asset to capsule and between capsules. It can be opened by clicking the gear icon ⚙️️ on any connection in the pipeline.
Was this helpful?
From the Map Paths menu, the source and destination paths can be changed and a Connection Type can be selected. Configuring this menu properly will ensure that each capsule receives the necessary data and that the pipeline is optimized for parallelization.
This section will cover the following:
Users can customize the flow of data in a pipeline by specifying source and destination paths. Source paths define which files should be transferred to the destination capsule, while destination paths specify where these files should be stored in the destination capsule. Multiple mappings can be used to provide additional configuration.
For example, Capsule A generates many different types of files in its results folder. Using the following source and destination mappings, all files with the extension .zip
and .html
will be sent to Capsule B's data folder in folders called zip_files
and html_files
, respectively. Any file without a .zip
or .html
extension will be ignored by Capsule B.
Data asset to capsule: each item will be distributed to a parallel instance of the capsule.
Capsule to capsule: a destination capsule instance will be executed for every instance of the source capsule.
Data asset to capsule: the entire data asset will be made available to all parallel instances of the destination capsule.
Capsule to capsule: the source data will be made available as a whole to all parallel instances of the destination capsule.
If there is only one input (data asset or capsule) to the destination capsule and the Connection Type is Collect, there will only be one instance of the destination capsule.
If the source data consists of a single item which is needed by all instances of the destination capsule, Collect should be used. Otherwise only one instance of the destination capsule will receive the source data.
Collect was formerly Global.
Data asset to capsule: same as Default
Capsule to capsule: the source data will be split such that every item is passed separately into each parallel instance of the destination capsule.
This example shows how items from two data assets will be distributed across parallel instances of a capsule.
The left side shows the pipeline schematic where two data assets are connected to a single capsule. Each data asset contains 3 items; nums
contains 3 files and alpha
contains 2 files and 1 folder.
On the right is a table showing the distribution of input items across parallel capsule instances for each connection type combination.
Parallel capsule instances will be created based on the number of items in the data asset with fewer items. Items from the other data asset will be randomly distributed to parallel instances, with extra items being left out of the computation.
This example shows how results generated by a source capsule (Capsule A) affect the execution of a destination capsule (Capsules B-F) when different connection types are used. It also shows how source mappings can be used in combination with connection types to further customize the pipeline execution.
In this scenario, there are 3 parallel instances of Capsule A each producing the same output. The number of capsule icons represents the number of parallel instances.
When many capsules are connected to the Results Bucket, it's helpful to write each capsule's output to a uniquely named folder. This can be achieved by opening the Map Paths menu and adding a folder name to the destination path.
For example, Capsule B's results will be written to a folder called Capsule_B
:
If there are multiple instances of Capsule B each producing results with the same name, they will overwrite each other in the Results Bucket unless Generate indexed folders is used.
The Map Paths menu between a capsule and the Results Bucket has a Generate indexed folders switch. By default, files will be written to the Results Bucket in the same way they’re written to the capsule’s results folder. Turning on generate indexed folders will write files to a unique folder for each instance of the source capsule.
For example, if there are three instances of the source capsule and generate indexed folders is on, results from each instance will be written into separate folders named 1
, 2
, and 3
.
Files from folders and subfolders of the source (data asset or capsule results) can be passed to the destination capsule without preserving directory structure by adding **
to the source path. This is particularly useful when combined with the Flatten Connection Type (see Capsule F in ).