Sifter Playbook Schema¶
This document provides a comprehensive description of the Sifter Playbook format, its input methods (extractors), and its transformation steps.
Playbook Structure¶
A Playbook is a YAML file that defines an ETL pipeline.
| Field | Type | Description |
|---|---|---|
class |
string | Should be sifter. |
name |
string | Unique name of the playbook. |
docs |
string | Documentation string for the playbook. |
outdir |
string | Default output directory for emitted files. |
params |
map | Configuration variables with optional defaults and types (File, Dir). |
inputs |
map | Named extractor definitions. |
outputs |
map | Named outputs definitions. |
pipelines |
map | Named transformation pipelines (arrays of steps). |
Parameters (params)¶
Parameters allow playbooks to be parameterized. They are defined in the params section of the playbook YAML file.
Params Syntax¶
Supported Types¶
File: Represents a file pathDir: Represents a directory path
params:
inputDir:
type: Dir
default: "./data/input"
outputDir:
type: Dir
default: "./data/output"
schemaFile:
type: File
default: "./config/schema.json"
Input Methods (Extractors)¶
Extractors produce a stream of messages from various sources.
table¶
Loads data from a delimited file (TSV/CSV).
- path: Path to the file.
- rowSkip: Number of header rows to skip.
- columns: Optional list of column names.
- extraColumns: Field name to store any columns beyond the declared ones.
- sep: Separator (default \t for TSVs, , for CSVs).
json¶
Loads data from a JSON file (standard or line-delimited).
- path: Path to the file.
- multiline: Load file as a single multiline JSON object.
avro¶
Loads data from an Avro object file.
- path: Path to the file.
xml¶
Loads and parses XML data.
- path: Path to the file.
- level: Depth level to start breaking XML into discrete messages.
sqlite¶
Loads data from a SQLite database.
- path: Path to the database file.
- query: SQL SELECT statement.
transpose¶
Loads a TSV and transposes it (making rows from columns).
- input: Path to the file.
- rowSkip: Rows to skip.
- sep: Separator.
- useDB: Use a temporary disk database for large transpositions.
plugin (Extractor)¶
Runs an external command that produces JSON messages to stdout.
- commandLine: The command to execute.
embedded (Extractor)¶
Load data from embedded structure. - No parameters required.
glob (Extractor)¶
Scan files using * based glob statement and open all files as input.
- path: Path of avro object file to transform.
- storeFilename: Store value of filename in parameter each row.
- xml: xmlLoad data.
- table: Run transform pipeline on a TSV or CSV.
- json: Run a transform pipeline on a multi line json file.
- avro: Load data from avro file.
Transformation Steps¶
Transformation pipelines are arrays of steps. Each step can be one of the following:
Core Processing¶
from: Start a pipeline from a named input or another pipeline.emit: Write messages to a JSON file. Fields:name,useName(bool).objectValidate: Validate messages against a JSON schema. Fields:title,schema(directory),uri.debug: Print message contents to stdout. Fields:label,format.plugin(Transform): Pipe messages through an external script via stdin/stdout. Fields:commandLine.
Mapping and Projection¶
project: Map templates into new fields. Fields:mapping(key-template pairs),rename(simple rename).map: Apply a Python/GPython function to each record. Fields:method(function name),python(code string),gpython(path or code).flatMap: Similar tomap, but flattens list responses into multiple messages.fieldParse: Parse a string field (e.g.key1=val1;key2=val2) into individual keys. Fields:field,sep.fieldType: Cast fields to specific types (int,float,list). Represented as a map offieldName: type.
Filtering and Cleaning¶
filter: Drop messages based on criteria. Fields:field,value,match,check(exists/hasValue/not), orpython/gpythoncode.clean: Remove fields. Fields:fields(list of kept fields),removeEmpty(bool),storeExtra(target field for extras).dropNull: Remove fields withnullvalues from a message.distinct: Only emit messages with a unique value once. Field:value(template).
Grouping and Lookups¶
reduce: Merge messages sharing a key. Fields:field(key),method,python/gpython,init(initial data).accumulate: Group all messages sharing a key into a list. Fields:field(key),dest(target list field).lookup: Join data from external files (TSV/JSON). Fields:tsv,json,replace,lookup,copy(mapping of fields to copy).intervalIntersect: Match genomic intervals. Fields:match(CHR),start,end,field(dest),json(source file).
Specialized¶
hash: Generate a hash of a field. Fields:field(dest),value(template),method(md5,sha1,sha256).uuid: Generate a UUID. Fields:field,value(seed),namespace.graphBuild: Convert messages into graph vertices and edges using schema definitions. Fields:schema,title.tableWrite: Write specific fields to a delimited output file. Fields:output,columns,sep,header,skipColumnHeader.split: Split a single message into multiple based on a list field.