Column Generators
Column generators execute column generation in the Data Designer engine. A generator receives the upstream data needed for its task, returns row or batch data with generated values added, and reports the generation strategy the scheduler should use.
Related pages: column_configs, Build Your Own, Using Models in Plugins, and Custom Columns.
Configuration
User-facing column configs inherit from SingleColumnConfig and define a unique column_type discriminator. During compilation, the engine may group related configs into multi-column configs for generators that create sampler or seed columns together.
Generation strategy
Column generator base classes return GenerationStrategy values to tell the engine whether they run per row or over a full batch.
Implementation bases
Generators that operate on a full batch can inherit from ColumnGeneratorFullColumn. Row-oriented non-model generators can inherit from ColumnGeneratorCellByCell. Generators that create initial rows use FromScratchColumnGenerator. Model-backed plugin generators should use ColumnGeneratorWithModelRegistry or ColumnGeneratorWithModel; see Using Models in Plugins for authoring guidance.
ColumnGenerator
Bases: ConfigurableTask[TaskConfigT], ABC
Methods:
| Name | Description |
|---|---|
agenerate |
Async generate — delegates to sync |
generate |
Sync generate — overridden by most concrete generators. |
get_scheduling_metadata |
Return static scheduler metadata for this generator. |
log_pre_generation |
A shared method to log info before the generator's |
Attributes:
| Name | Type | Description |
|---|---|---|
is_order_dependent |
bool
|
Whether this generator's output depends on prior row-group calls. |
Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24 25 26 27 28 | |
is_order_dependent
property
Whether this generator's output depends on prior row-group calls.
Example: SeedDatasetColumnGenerator tracks its position in the seed dataset, so row group N must complete before N+1 starts.
agenerate(data)
async
agenerate(data: dict) -> dict
agenerate(data: pd.DataFrame) -> pd.DataFrame
Async generate — delegates to sync generate() via thread pool.
Subclasses with native async support (e.g. ColumnGeneratorWithModelChatCompletion) should override this with a direct async implementation.
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
118 119 120 121 122 123 124 125 126 | |
generate(data)
generate(data: dict) -> dict
generate(data: pd.DataFrame) -> pd.DataFrame
Sync generate — overridden by most concrete generators.
Default bridges to agenerate() for async-first subclasses that only
implement agenerate(). Raises NotImplementedError if neither
generate() nor agenerate() is overridden.
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
101 102 103 104 105 106 107 108 109 110 | |
get_scheduling_metadata()
Return static scheduler metadata for this generator.
Generators that do not declare model-backed behavior use the documented local default. Model-aware base classes override this with provider/model resource identity derived from registered model aliases.
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
69 70 71 72 73 74 75 76 | |
log_pre_generation()
A shared method to log info before the generator's generate method is called.
The idea is for dataset builders to call this method for all generators before calling their
generate method. This is to avoid logging the same information multiple times when running
generators in parallel.
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
128 129 130 131 132 133 134 | |
ColumnGeneratorFullColumn
Bases: ColumnGenerator[TaskConfigT], ABC
Base class for column generators that transform a full batch at once.
Override generate to return the complete batch DataFrame after adding
generated values. Use this base when generation is vectorizable or when an
external API accepts batched input more efficiently than per-row calls.
Methods:
| Name | Description |
|---|---|
generate |
Generate an entire batch of row outputs. |
Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24 25 26 27 28 | |
generate(data)
abstractmethod
Generate an entire batch of row outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame containing the upstream columns this generator depends on. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the input columns plus the new column and any side-effect |
DataFrame
|
columns. When |
DataFrame
|
the input; when it is |
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
326 327 328 329 330 331 332 333 334 335 336 337 | |
ColumnGeneratorCellByCell
Bases: ColumnGenerator[TaskConfigT], ABC
Base class for column generators invoked once per row.
Override generate to return the complete row mapping after adding the
generated value. The engine calls the generator once per row and may run
calls concurrently. Use this base when generation is independent per row
(e.g. an LLM call per row, a per-row transform).
Methods:
| Name | Description |
|---|---|
generate |
Generate one row's output from a single row's upstream values. |
Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24 25 26 27 28 | |
generate(data)
abstractmethod
Generate one row's output from a single row's upstream values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict
|
Current row mapping containing the upstream values available to this column. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Complete row mapping with existing keys preserved and the new column value added. |
dict
|
Include declared side-effect columns when the config creates them. |
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
301 302 303 304 305 306 307 308 309 310 311 | |
FromScratchColumnGenerator
Bases: ColumnGenerator[TaskConfigT], ABC
Methods:
| Name | Description |
|---|---|
agenerate_from_scratch |
Async wrapper — wraps sync |
Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24 25 26 27 28 | |
agenerate_from_scratch(num_records)
async
Async wrapper — wraps sync generate_from_scratch() in a thread.
Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
145 146 147 | |
ColumnGeneratorWithModelRegistry
Bases: ColumnGenerator[TaskConfigT], ABC
Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24 25 26 27 28 | |
ColumnGeneratorWithModel
Bases: ColumnGeneratorWithModelRegistry[TaskConfigT], ABC
Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24 25 26 27 28 | |