API documentation

Overview

The main purpose of Romanesco is to execute a broad range of tasks. These tasks, along with a set of input bindings and output bindings are passed to the romanesco.run() function, which is responsible for fetching the inputs as necessary and executing the task, and finally populating any output variables and sending them to their destination.

The task, its inputs, and its outputs are each passed into the function as python dictionaries. In this section, we describe the structure of each of those dictionaries.

The task specification

The first argument to romanesco.run() describes the task to execute, independently of the actual data that it will be executed upon. The most important field of the task is the mode, which describes what type of task it is. The structure for the task dictionary is described below. Uppercase names within angle braces represent symbols defined in the specification. Optional parts of the specification are surrounded by parenthesis to avoid ambiguity with the square braces, which represent lists in python or Arrays in JSON. The Python task also accepts a write_script paramater that when set to 1 will write task scripts to disk before executing them. This aids in readability for interactive debuggers such as pdb.

<TASK> ::= <PYTHON_TASK> | <R_TASK> | <DOCKER_TASK> | <WORKFLOW_TASK>

<PYTHON_TASK> ::= {
    "mode": "python",
    "script": <python code to run as a string>
    (, "inputs": [<TASK_INPUT> (, <TASK_INPUT>, ...)])
    (, "outputs": [<TASK_OUTPUT> (, <TASK_OUTPUT>, ...)])
    (, "write_script": 1)
}

<R_TASK> ::= {
    "mode": "r",
    "script": <r code to run (as a string)>
    (, "inputs": [<TASK_INPUT> (, <TASK_INPUT>, ...)])
    (, "outputs": [<TASK_OUTPUT> (, <TASK_OUTPUT>, ...)])
}

<DOCKER_TASK> ::= {
    "mode": "docker",
    "docker_image": <docker image name to run>
    (, "container_args": [<container arguments>])
    (, "entrypoint": <custom override for container entry point>)
    (, "inputs": [<TASK_INPUT> (, <TASK_INPUT>, ...)])
    (, "outputs": [<TASK_OUTPUT> (, <TASK_OUTPUT>, ...)])
}

<WORKFLOW_TASK> ::= {
    "mode": "workflow",
    "steps": [<WORKFLOW_STEP> (, <WORKFLOW_STEP>, ...)],
    "connections": [<WORKFLOW_CONNECTION> (, <WORKFLOW_CONNECTION>, ...)]
    (, "inputs": [<TASK_INPUT> (, <TASK_INPUT>, ...)])
    (, "outputs": [<TASK_OUTPUT> (, <TASK_OUTPUT>, ...)])
}

<WORKFLOW_STEP> ::= {
    "name": <step name>,
    "task": <TASK>
}

<WORKFLOW_CONNECTION> ::= {
    ("name": <name of top-level input to bind to>)
    (, "input": <input id to bind to for a step>)
    (, "input_step": <input step name to connect>)
    (, "output_step": <output step name to connect>)
}

The workflow mode simply allows for a directed acyclic graph of tasks to be specified to romanesco.run().

<TASK_INPUT> ::= {
    "id": <string, the variable name>,
    "type": <data type>,
    "format": <data format>
    (, "default": <default value if none is bound at runtime>)
    (, "target": <INPUT_TARGET_TYPE>)   ; default is "memory"
    (, "filename": <name of file if target="filepath">)
}

<INPUT_TARGET_TYPE> ::= "memory" | "filepath"

<TASK_OUTPUT> ::= {
    "id": <string, the variable name>,
    "type": <data type>,
    "format": <data format>
    (, "target": <INPUT_TARGET_TYPE>)   ; default is "memory"
}

The input specification

The inputs argument to romanesco.run() specifies the inputs to the task described by the task argument. Specifically, it tells what data should be placed into the task input ports.

<INPUTS> ::= {
    <id> : <INPUT_BINDING>
    (, <id> : <INPUT_BINDING>)
    (, ...)
}

The input spec is a dictionary mapping each id (corresponding to the id key of each task input) to its data binding for this execution.

<INPUT_BINDING> ::= <INPUT_BINDING_HTTP> | <INPUT_BINDING_LOCAL> |
                    <INPUT_BINDING_MONGODB> | <INPUT_BINDING_INLINE>

<INPUT_BINDING_HTTP> ::= {
    "mode": "http",
    "format": <data format>,
    "url": <url of data to download>
    (, "headers": <dict of HTTP headers to send when fetching>)
    (, "method": <http method to use, default is "GET">)
    (, "maxSize": <integer, max size of download in bytes>)
}

The http input mode specifies that the data should be fetched over HTTP. Depending on the target field of the corresponding task input specifier, the data will either be passed in memory, or streamed to a file on the local filesystem, and the variable will be set to the path of that file.

<INPUT_BINDING_LOCAL> ::= {
    "mode": "local",
    "format": <data format>,
    "path": <path on local filesystem to the file>
}

The local input mode denotes that the data exists on the local filesystem. Its contents will be read into memory and the variable will point to those contents.

<INPUT_BINDING_MONGODB> ::= {
    "mode": "mongodb",
    "format": <data format>,
    "db": <the database to use>,
    "collection": <the collection to fetch from>
    (, "host": <mongodb host, default is "localhost">)
}

The mongodb input mode specifies that the data should be fetched from a mongo collection. This simply binds the entire BSON-encoded collection to the input variable.

<INPUT_BINDING_INLINE> ::= {
    "mode": "inline",
    "format": <data format>,
    "data": <data to bind to the variable>
}

The inline input mode simply passes the data directly in the input binding dictionary as the value of the “data” key. Do not use this for any data that could be large.

The output specification

The optional outputs argument to romanesco.run() specifies output variables of the task that should be handled in some way.

<OUTPUTS> ::= {
    <id> : <OUTPUT_BINDING>
    (, <id> : <OUTPUT_BINDING>)
    (, ...)
}

The output spec is a dictionary mapping each id (corresponding to the id key of each task output) to some behavior that should be performed with it. Task outputs that do not have bindings in the ouput spec simply get their results set in the return value of romanesco.run().

<OUTPUT_BINDING> ::= <OUTPUT_BINDING_HTTP> | <OUTPUT_BINDING_LOCAL> |
                     <OUTPUT_BINDING_MONGODB>

<OUTPUT_BINDING_HTTP> ::= {
    "mode": "http",
    "url": <url to upload data to>,
    "format": <data format>
    (, "headers": <dict of HTTP headers to send with the request>)
    (, "method": <http method to use, default is "POST">)
}

<OUTPUT_BINDING_LOCAL> ::= {
    "mode": "local",
    "format": <data format>,
    "path": <path to write data on the local filesystem>
}

The local output mode writes the data to the specified path on the local filesystem.

<OUTPUT_BINDING_MONGODB> ::= {
    "mode": "mongodb",
    "db": <mongo database to write to>,
    "format": <data format>,
    "collection": <mongo collection to write to>
    (, "host": <mongo host to connect to>)
}

The mongodb output mode attempts to BSON-decode the bound data, and then overwrites any data in the specified collection with the output data.

Script execution

romanesco.convert(type, input, output, **kwargs)[source]

Convert data from one format to another.

Parameters:
  • type – The type specifier string of the input data.
  • input – A binding dict of the form {"format": format, "data", data}, where format is the format specifier string, and data is the raw data to convert. The dict may also be of the form {"format": format, "uri", uri}, where uri is the location of the data (see romanesco.uri for URI formats).
  • output – A binding of the form {"format": format}, where format is the format specifier string to convert the data to. The binding may also be in the form {"format": format, "uri", uri}, where uri specifies where to place the converted data.
Returns:

The output binding dict with an additional field "data" containing the converted data. If "uri" is present in the output binding, instead saves the data to the specified URI and returns the output binding unchanged.

romanesco.isvalid(type, binding, **kwargs)[source]

Determine whether a data binding is of the appropriate type and format.

Parameters:
  • type – The expected type specifier string of the binding.
  • binding – A binding dict of the form {"format": format, "data", data}, where format is the format specifier string, and data is the raw data to test. The dict may also be of the form {"format": format, "uri", uri}, where uri is the location of the data (see romanesco.uri for URI formats).
Returns:

True if the binding matches the type and format, False otherwise.

romanesco.load(task_file)[source]

Load a task JSON into memory, resolving any "script_uri" fields by replacing it with a "script" field containing the contents pointed to by "script_uri" (see romanesco.uri for URI formats). A script_fetch_mode field may also be set

Parameters:analysis_file – The path to the JSON file to load.
Returns:The analysis as a dictionary.
romanesco.register_executor(name, fn)[source]

Register a new executor in the romanesco runtime. This is used to map the “mode” field of a task to a function that will execute the task.

Parameters:
  • name (str) – The value of the mode field that maps to the given function.
  • fn (function) – The implementing function.
romanesco.run(*args, **kwargs)[source]

Run a Romanesco task with the specified I/O bindings.

Parameters:
  • task (dict) – Specification of the task to run.
  • inputs – Specification of how input objects should be fetched into the runtime environment of this task.
  • outputs (dict) – Speficiation of what should be done with outputs of this task.
  • auto_convert – If True (the default), perform format conversions on inputs and outputs with convert() if they do not match the formats specified in the input and output bindings. If False, an expection is raised for input or output bindings do not match the formats specified in the analysis.
  • validate – If True (the default), perform input and output validation with isvalid() to ensure input bindings are in the appropriate format and outputs generated by the script are formatted correctly. This guards against dirty input as well as buggy scripts that do not produce the correct type of output. An invalid input or output will raise an exception. If False, perform no validation.
  • write_script – If True task scripts will be written to file before being passed to exec. This improves interactive debugging with tools such as pdb at the cost of additional file I/O. Note that when passed to run all tasks will be written to file including validation and conversion tasks.
Returns:

A dictionary of the form name: binding where name is the name of the output and binding is an output binding of the form {"format": format, "data": data}. If the outputs param is specified, the formats of these bindings will match those given in outputs. Additionally, "data" may be absent if an output URI was provided. Instead, those outputs will be saved to that URI and the output binding will contain the location in the "uri" field.

romanesco.unregister_executor(name)[source]

Unregister an executor from the map.

Parameters:name (str) – The name of the executor to unregister.

Formats

class romanesco.format.Validator[source]
type

The validator type, like string.

format

The validator format, like text.

romanesco.format.converter_path(source, target)[source]

Gives the shortest path that should be taken to go from a source type/format to a target type/format.

Throws a NetworkXNoPath exception if it can not find a path.

Parameters:
  • source – Validator tuple indicating the type/format being converted from.
  • targetValidator tuple indicating the type/format being converted to.
Returns:

An ordered list of the analyses that need to be run to convert from

source to target.

romanesco.format.get_validator(validator)[source]

Gets a validator node from the conversion graph by its type and format.

>>> validator = get_validator(Validator('string', 'text'))

Returns a tuple containing 2 elements >>> len(validator) 2

First is the Validator namedtuple >>> validator[0] Validator(type=u’string’, format=u’text’)

and second is the validator itself >>> validator[1].keys() [‘validator’, ‘type’, ‘format’]

If the validator doesn’t exist, an exception will be raised >>> get_validator(Validator(‘foo’, ‘bar’)) Traceback (most recent call last):

...

Exception: No such validator foo/bar

Parameters:validator – A Validator namedtuple
Returns:A tuple including the passed Validator namedtuple, with a second

element being the analysis data.

romanesco.format.has_converter(source, target=Validator(type=None, format=None))[source]

Determines if any converters exist from a given type, and optionally format.

Underneath, this just traverses the edges until it finds one which matches the arguments.

Parameters:
  • sourceValidator tuple indicating the type/format being converted from.
  • targetValidator tuple indicating the type/format being converted to.
Returns:

True if it can converter from source to target, False

otherwise.

romanesco.format.import_converters(search_paths)[source]

Import converters and validators from the specified search paths. These functions are loaded into romanesco.format.conv_graph with nodes representing validators, and directed edges representing converters.

Any files in a search path matching validate_*.json are loaded as validators. Validators should be fast (ideally O(1)) algorithms for determining if data is of the specified format. These are algorithms that have a single input named "input" and a single output named "output". The input has the type and format to be checked. The output must have type and format "boolean". The script performs the validation and sets the output variable to either true or false.

Any *_to_*.json files are imported as converters. A converter is simply an analysis with one input named "input" and one output named "output". The input and output should have matching type but should be of different formats.

Parameters:search_paths (str or list of str) – A list of search paths relative to the current working directory. Passing a single path as a string also works.
romanesco.format.import_default_converters()[source]

Import converters from the default search paths. This is called when the romanesco.format module is first loaded.

romanesco.format.print_conversion_graph()[source]

Print a graph of supported conversion paths in DOT format to standard output.

romanesco.format.print_conversion_table()[source]

Print a table of supported conversion paths in CSV format with "from" and "to" columns to standard output.

URIs