datalad-foreach-dataset - Man Page

run a command or Python code on the dataset and/or each of its sub-datasets.

Synopsis

datalad foreach-dataset [-h] [--cmd-type {auto|external|exec|eval}] [-d DATASET] [--state {present|absent|any}] [-r] [-R LEVELS] [--contains PATH] [--bottomup] [-s] [--output-streams {capture|pass-through|relpath}] [--chpwd {ds|pwd}] [--safe-to-consume {auto|all-subds-done|superds-done|always}] [-J NJOBS] [--version] ...

Description

This command provides a convenience for the cases were no dedicated DataLad command is provided to operate across the hierarchy of datasets. It is very similar to `git submodule foreach` command with the following major differences

- by default (unless --subdatasets-only) it would include operation on the original dataset as well, - subdatasets could be traversed in bottom-up order, - can execute commands in parallel (see JOBS option), but would account for the order, e.g. in bottom-up order command is executed in super-dataset only after it is executed in all subdatasets.

Additional notes:

- for execution of "external" commands we use the environment used to execute external git and git-annex commands.

Command format

--cmd-type external: A few placeholders are supported in the command via Python format specification:

- "{pwd}" will be replaced with the full path of the current working directory. - "{ds}" and "{refds}" will provide instances of the dataset currently operated on and the reference "context" dataset which was provided via ``dataset`` argument. Therefore, {ds.path} returns the path to the submodule and {ds.id} produces the submodule's datalad ID. The placeholder "refds" is used in the same way. See the API documentation of "Dataset" for more options. - "{tmpdir}" will be replaced with the full path of a temporary directory.

Examples

Aggressively git clean all datasets, running 5 parallel jobs::

% datalad foreach-dataset -r -J 5 git clean -dfx

Options

COMMAND: command for execution. A leading '--' can be used to disambiguate this command from the preceding options to DataLad. For --cmd-type exec or eval only a single command argument (Python code) is supported.
-h, --help, --help-np: show this help message. --help-np forcefully disables the use of a pager for displaying the help message
--cmd-type {auto|external|exec|eval}: type of the command. EXTERNAL: to be run in a child process using dataset's runner; 'exec': Python source code to execute using 'exec(), no value returned; 'eval': Python source code to evaluate using 'eval()', return value is placed into 'result' field. 'auto': If used via Python API, and `cmd` is a Python function, it will use 'eval', and otherwise would assume 'external'. Constraints: value must be one of ('auto', 'external', 'exec', 'eval') [Default: 'auto']
-d DATASET, --dataset DATASET: specify the dataset to operate on. If no dataset is given, an attempt is made to identify the dataset based on the input and/or the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--state {present|absent|any}: indicate which (sub)datasets to consider: either only locally present, absent, or any of those two kinds. Constraints: value must be one of ('present', 'absent', 'any') [Default: 'present']
-r, --recursive: if set, recurse into potential subdatasets.
-R LEVELS, --recursion-limit LEVELS: limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type 'int' or value must be NONE
--contains PATH: limit to the subdatasets containing the given path. If a root path of a subdataset is given, the last considered dataset will be the subdataset itself. This option can be given multiple times, in which case datasets that contain any of the given paths will be considered. Constraints: value must be a string or value must be NONE
--bottomup: whether to report subdatasets in bottom-up order along each branch in the dataset tree, and not top-down.
-s, --subdatasets-only: whether to exclude top level dataset. It is implied if a non-empty CONTAINS is used.
--output-streams {capture|pass-through|relpath}, --o-s {capture|pass-through|relpath}: ways to handle outputs. 'capture' and return outputs from 'cmd' in the record ('stdout', 'stderr'); 'pass-through' to the screen (and thus absent from returned record); prefix with 'relpath' captured output (similar to like grep does) and write to stdout and stderr. In 'relpath', relative path is relative to the top of the dataset if DATASET is specified, and if not - relative to current directory. Constraints: value must be one of ('capture', 'pass-through', 'relpath') [Default: 'pass-through']
--chpwd {ds|pwd}
--safe-to-consume {auto|all-subds-done|superds-done|always}: Important only in the case of parallel (jobs greater than 1) execution. 'all-subds-done' instructs to not consider superdataset until command finished execution in all subdatasets (it is the value in case of 'auto' if traversal is bottomup). 'superds-done' instructs to not process subdatasets until command finished in the super-dataset (it is the value in case of 'auto' in traversal is not bottom up, which is the default). With 'always' there is no constraint on either to execute in sub or super dataset. Constraints: value must be one of ('auto', 'all-subds-done', 'superds-done', 'always') [Default: 'auto']
-J NJOBS, --jobs NJOBS: how many parallel jobs (where possible) to use. "auto" corresponds to the number defined by 'datalad.runtime.max-annex-jobs' configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type 'int' or value must be NONE or value must be one of ('auto',)
--version: show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <The DataLad Team and Contributors <team@datalad.org>>.

Info

2026-06-08 datalad foreach-dataset 1.5.0