9. Advanced Script Features¶
9.1. Depend on extra files¶
A job script may import and execute code located in other files. Such
files can be included in the build check hash calculation as well.
This will ensure that a change to an imported file will indeed force a
re-execution of the job script when a build is requested. Specify
additional files using the depend_extra list early in the job
script, as in this example
from . import my_python_module
depend_extra = (my_python_module, 'my_other_file',)
As seen above, it is possible to specify either imported Python module objects or filenames relative to the job script’s location.
If Exax suspects that a depend_extra-statement is missing, it will
suggest adding it by printing a message in the output log like this
====================================================================
WARNING: dev.a_test should probably depend_extra on my_python_module
====================================================================
Note
depend_extra will add the external source code to the
current job directory and include them in the hash computation.
To keep operation fast and limit disk occupied by file copies, do not depend_extra on more than is actually needed.
9.2. Equivalent Hashes¶
A change to a job script’s source code will cause a new job to be built upon running .build(), but sometimes it is desirable to modify the source code without causing a re-build. This happens, for example, when new comments are added to an existing job script, and re-computation of all jobs is not an option for time reasons. If the functionality after a change is known to remain the same, existing jobs strictly do not need to be re-built. For this special situation, there is an equivalent_hashes dictionary that can be used to manually specify which versions of the source code that are equivalent. Exax helps creating this dictionary, if needed. Here is how it works.
Find the hash <old_hash> of the existing job in that job’s setup.json.
Add the following line to the updated job script’s source code
equivalent_hashes = {'whatever': (<old_hash>,)}Run the build script. The server will print something like
=========================================================== WARNING: test_methods.a_test_rechain has equivalent_hashes, but missing verifier <current_hash> ===========================================================Copy
<current_hash>into theequivalent_hashes:equivalent_hashes = {<current_hash>: (<old_hash>,)}
This line now tells that current_hash is equivalent to old_hash, so if a job with the old hash exists, the job script will not be built again. Note that the right part of the assignment is actually a list, so there could be any number of equivalent versions of the source code.
From time to time, this has been used during development of Exax’s standard_methods, but for everyday work it should probably be avoided.
9.3. Accessing a Job’s parameters¶
All job parameters are available in the Job.params dict. It
contains a lot of information, including start and end timestamps,
exax version, python version, location of paths defined in the
configuration file, and more
job = urd.build('example')
print(job.params)
print(job.params.starttime)
print(job.params.options)
print(job.params.jobs)
9.4. Accessing a Job’s post data¶
The post data is written when the job finishes. It contains profiling
information (execution time) for prepare, synthesis, and all
analysis slices, as well as any subjobs built or files created.
job = urd.build('example')
print(job.post)
9.5. Creating Custom Status Messages¶
Most built-in job scripts and functions, such as for example csvimport
and json_load(), provide their own status messages to be displayed
when pressing CTRL+T (see Progress/status reporting).
However, it is also possible to create custom status messages for your
scripts using the status context manager. Here’s an example:
from accelerator import status
...
def synthesis():
with status('reading huge file') as s:
jobs.source.load('bigfile')
And here’s another example that updates the status message to reflect the progress of the program
from accelerator import status
...
def analysis(sliceno):
msg = "reached line %d already!"
with status(msg % (0,) as update:
for ix, data in enumerate(datasets.source.iterate(sliceno, 'data')):
if ix % 1000000 == 0:
update(msg % (ix,))
The last created status message will be printed to stdout when
CTRL+t is pressed.
9.6. Limiting Concurrency¶
By default, a job containing the analysis() function will be forked
into slices parallel processes, where slices` is specified in
the Accelerator’s configuration file. For a well written parallel
program, this can maximise the usage of the computers CPU resources.
On the other hand, there exists job scripts that use a large amount of
memory, perhaps scaling with the number of slices, so unless the
machine has plenty of RAM, these scripts may run out of memory. One
example of such a script is the standard job script dataset_sort,
which sorts datasets in parallel in all slices for maximum
performance. Reducing the number of slices globally in the
configuration file to handle the worst case is not the optimal
solution.
Instead, Exax implements a concurrency parameter that can operate
on a single build call or job script. It can be set either in the
build call, using the concurrency= parameter, or on the command
line as an option to the ax run command. In both cases, the limit
could be set to all job scripts, or it could be set to a specific job
script only. While the default behaviour is to fork all
analysis()-processes in parallel, the concurrency parameter will
limit the number of forks and dispatch a new analysis()-process as
soon as a previous one is finished untill all slices are exhausted.
For example, if concurrency is set to a number, like this
ax run --concurrency=3 mybuild
or
urd.build('myscript', concurrency=3)
the number of parallel processes is limited to in this case three for all job scripts . Alternatively, concurrency can be specified for a single job script like this
ax run --concurrency="dataset_sort=3" mybuild
and all scripts except the dataset_sort will run on all slices,
while dataset_sort will run on three slices.
Note
A job is not aware of, and does not store, the concurrency setting. There is no way to tell afterwards if the job was created using full parallelisation or not.
9.7. Automatic Slice-Data Merging: merge_auto¶
When all analysis() processes have finished, their results are
available to synthesis() using the analysis_res variable.
This variable is actually an iterator, presenting one analysis()
result at a time.
Merging these results has to be done with caution, it is easy to get
it wrong. In order to save development as well as debugging time, the
analysis_res object has a member function merge_auto(), which
merges the data from all slices into one object. This section
descripes more in deptht how it works.
To start with, here is the basic setup
def analysis(sliceno):
data = function(sliceno, ...)
return data
def synthesis(analysis_res):
total = analysis_res.merge_auto()
where total will contain the merged data from all slices.
The default operation of merge_auto() is as follows
lists are concatenated
([1, 2, 3], [4, 5, 6]).merge_auto() == [1, 2, 3, 4, 5, 6]integers are added
(1, 2, 3).merge_auto() == 6This could for example be parallel counting of lines in a log file.
types with an
update()member will be updated({'lemon', 'apple'}, {'apple', 'pear'}).merge_auto() == {'lemon', 'apple', 'pear'}({'lemon': 3}, {'lemon': 4}) == {'lemon': 7}each item in a tuple is merged independently
If two sets contain the same element this is considered by default to be okay, the item is a member of at least one of the input sets, like the lemon, apple, pear example above.
But for dictionaries, the same key may have different values in different input sets, and the question is then what to do, like in this example
({'skywalker': 'luke'}, {'skywalker': 'anakin'}.merge_auto() == ?
To resolve the situation, merge_auto() takes an input argument
allow_overwrite that can be either of True, False, or
None.
Truemeans that the last merged value will be output.
Falsewill raise an error if the same key exists in multiple slices. This will also cause an exception when merging two sets or Counters that share at least one key.
NoneIs inbetween. It will raise an exception if there are duplicate keys for dictionaries, but not for sets or Counters. This is the default.
Note that these rules applies to the bottom of the hierarcy only. For multi-level dictionaries keys can always overlap on the higher levels. For example
This is okay, because the bottom level is a set and the default is
None, meaning that duplicate set members is okay. The two levels
above, sold_items and books, are not part of the duplicate
check. The merge will produce sets of sold items while maintaining
the different category hierarcy.
9.8. Forced Builds¶
In very rare circumstances a forced build may be wanted. A re-build can be forced like this
def main(urd):
urd.build('jobscript', force_build=True)