18. Autodoc Dataset Classes

class accelerator.dataset.Dataset(jobid, name=None)

Represents a dataset. Is also a string ‘jobid/name’, or just ‘jobid’ if name is ‘default’ (for better backwards compatibility).

You usually don’t have to make these yourself, because datasets.foo is already a Dataset instance.

You can pass jobid=”jid/name” or jobid=”jid”, name=”name”, or skip name completely for “default”.

You can also pass jobid={jid: dsname} to resolve dsname from the datasets passed to jid. This gives NoDataset if that option was unset.

These decay to a (unicode) string when pickled.

property columns

{name: DatasetColumn}

iterate(sliceno, columns=None, range=None, sloppy_range=False, hashlabel=None, pre_callback=None, post_callback=None, filters=None, translators=None, status_reporting=True, rehash=False, slice=None, copy_mode=False)

Iterate just this dataset. See .iterate_list for details.

iterate_chain(sliceno, columns=None, length=-1, range=None, sloppy_range=False, reverse=False, hashlabel=None, stop_ds=None, pre_callback=None, post_callback=None, filters=None, translators=None, status_reporting=True, rehash=False, slice=None, copy_mode=False)

Iterate a list of datasets. See .chain and .iterate_list for details.

static iterate_list(sliceno, columns, datasets, range=None, sloppy_range=False, hashlabel=None, pre_callback=None, post_callback=None, filters=None, translators=None, status_reporting=True, rehash=False, slice=None, copy_mode=False)

Iterator over the specified columns from datasets (iterable of dataset-specifiers, or single dataset-specifier). callbacks are called before and after each dataset is iterated.

filters decide which rows to include and can be a callable (called with the candidate tuple), or a dict {name: filter}. In the latter case each individual filter is called with the column value, or if it’s None uses the column value directly. All filters must say yes to yield a row. examples: filters={‘some_col’: some_dict.get} filters={‘some_col’: some_set.__contains__} filters={‘some_col’: some_str.__eq__} filters=lambda line: line[0] == line[1]

translators transform data values. It can be a callable (called with the candidate tuple and expected to return a tuple of the same length) or a dict {name: translation}. Each translation can be a function (called with the column value and returning the new value) or dict. Items missing in the dict yield None, which can be removed with filters={‘col’: None}.

Translators run before filters.

You can also pass a single name (a str) as columns, in which case you don’t get a tuple back (just the values). Tuple-filters/translators also get just the value in this case (column versions are unaffected).

If you pass a false value for columns you get all columns in name order.

If you pass sliceno=None you get all slices. If you pass sliceno=”roundrobin” you also get all slices, but one value at a time across slices. (This can be used to iterate a csvimport in the order of the original file.)

If you specify a hashlabel and rehash=False (the default) you will get an error if the a dataset does not use the specified hashlabel. If you specify rehash=True such datasets will be rehashed during iteration. You should usually build a new rehashed dataset (using the dataset_hashpart method), but this is available for when it makes sense.

range limits which rows you see. Specify {colname: (start, stop)} and only rows where start <= colvalue < stop will be returned. If you set sloppy_range=True you may get all rows from datasets that contain any rows you asked for. (This can be faster.)

status_reporting should normally be left as True, which will give you information about this iteration in ^T, but there is one case where you need to turn it off: If you manually zip a bunch of iterators, only one should do status reporting. (Otherwise it looks like you have nested iteration in ^T, and you will get warnings about incorrect ending order of statuses.)

slice takes a slice object defining which lines you want returned. Negative offsets are allowed. Both negative and positive offsets outside the range of the specified datasets give an error. This is equivalent to islice(iterate_list(…), start, stop, step) except it’s faster, allows negative offsets and errors on too big offsets. For convenience you can also pass slice=<some int> which is equivalent to slice(<some int>, None). Note that this is not the same as slice(<some int>), but more useful here.

copy_mode makes no promises about the returned types. Use it together with copy_mode on a DatasetWriter for faster copying. Not compatible with columns changing types across the list. Also not compatible with filters or translators.

Use this to expose a subjob as a dataset in your job: Dataset(subjid).link_to_here() will allow access to the subjob dataset under your jid. You can rename columns using rename={oldname: newname}, and discard colums with rename={name: None} and/or by specifying column_filter as an iterable of columns to include. column_filter applies after rename. Use override_previous to rechain (or unchain) the dataset. You can change the filename too, or clear it by setting ‘’.

merge(other, name='default', previous=None, allow_unrelated=False)

Merge this and other dataset. Columns from other take priority. If datasets do not have a common ancestor you get an error unless allow_unrelated is set. The new dataset always has the previous specified here (even if None). Returns the new dataset.

static new(columns, filenames, compressions, lines, minmax={}, filename=None, hashlabel=None, caption=None, previous=None, name='default')

columns = {“colname”: “type”}, lines = [n, …] or {sliceno: n}

class accelerator.dataset.DatasetWriter(columns={}, filename=None, hashlabel=None, hashlabel_override=False, caption=None, previous=None, name='default', parent=None, meta_only=False, for_single_slice=None, copy_mode=False, allow_missing_slices=False)

Create in prepare, use in analysis. Or do the whole thing in synthesis.

You can pass these through prepare_res, or get them by trying to create a new writer in analysis (don’t specify any arguments except an optional name).

There are three writing functions with different arguments:

dw.write_dict({column: value}) dw.write_list([value, value, …]) dw.write(value, value, …)

Values are in the same order as you add()ed the columns (which is in sorted order if you passed a dict). The dw.write() function names the arguments from the columns too.

If you set a column type to None that column is not inherited from the parent dataset. (Only works as an init argument, not with dw.add.)

If you want support for None values in a column you can pass none_support=True to dw.add, or {colname: (coltype, True)} to the constructor. If you pass a DatasetColumn (from ds.columns[name]) you will inherit both type and None-support of that column. In dw.add the none_support argument takes precedence over (tuple/DatasetColumn in) the coltype argument.

If you set hashlabel you can use dw.hashcheck(v) to check if v belongs in this slice. You can also call enable_hash_discard (in each slice, or after each set_slice), then the writer will discard anything that does not belong in this slice.

If you are not in analysis and you wish to use the functions above you need to call dw.set_slice(sliceno) first.

If you do not, you can instead get one of the splitting writer functions, that select which slice to use based on hashlabel, or round robin if there is no hashlabel.

dw.get_split_write_dict()({column: value}) dw.get_split_write_list()([value, value, …]) dw.get_split_write()(value, value, …)

These should of course be assigned to a local name for performance.

It is permitted (but probably useless) to mix different write or split functions, but you can only use either write functions or split functions.

You can also use dw.writers[colname] to get a typed_writer and use it as you please. The one belonging to the hashlabel will be filtering, and returns True if this is the right slice.

If you need to handle everything yourself, set meta_only=True and use dw.column_filename(colname) to find the right files to write to. In this case you also need to call dw.set_lines(sliceno, count) and dw.set_compressions(compression) (or dw.set_compressions({colname: compression}) if not all columns use the same compression) before finishing. You should also call dw.set_minmax(sliceno, {colname: (min, max)}) if you can.

If you are just copying from another dataset you can set copy_mode both here and in the iterator for that dataset for faster copying.

enable_hash_discard()

Make the write functions silently discard data that does not hash to the current slice.

finish()

Normally you don’t need to call this, but if you want to pass yourself as a dataset to a subjob you need to call this first.

class accelerator.dataset.DatasetList(iterable=(), /)

These are lists of datasets with some convenience methods.

column_count(column, types=None, none_support=None)

How many datasets in this chain contain column Optionally only considers column to exists if it is of a desired type, and/or has/lacks none support.

column_counts()

Counter {colname: occurances}

filter(predicate)

Same list but only with datasets for which predicate(ds) is true.

iterate(sliceno, columns=None, range=None, sloppy_range=False, hashlabel=None, pre_callback=None, post_callback=None, filters=None, translators=None, status_reporting=True, rehash=False, slice=None, copy_mode=False)

Iterate the datasets in this chain. See Dataset.iterate_list for usage

lines(sliceno=None)

Number of rows in this chain, optionally for a specific slice.

max(column)

Max value for column over the whole chain. Will be None if no dataset in the chain contains column, if all datasets are empty or if column has a type without min/max tracking

min(column)

Min value for column over the whole chain. Will be None if no dataset in the chain contains column, if all datasets are empty or if column has a type without min/max tracking

none_support(column)

If any dataset in the chain has None support for this column

range(colname, start=None, stop=None)

Filter out only datasets where colname has values in range(start, stop)

with_column(column, types=None, none_support=None)

Chain without any datasets that don’t contain column. Optionally only considers column to exists if it is of a desired type, and/or has/lacks none support.

class accelerator.dataset.DatasetChain(iterable=(), /)