16. Troubleshooting, Trix and Tips

Some common problems and solutions

16.1. The exax server fails to start

There are two common reasons for this.

  • The first is that the configuration file is wrong, or that workdirs do not exist. Make sure that the workdir(s) specified in the confiuration file also exists on the file system.

  • The second is that there is an error in one or more job scripts. When exax starts, it tries to import all job scripts. If a script cannot be imported because it contains an error, the server fails to start. Read the output carefully and fix the script.

16.2. Urd conflict Error

When doing urd.finish(), the program exits with an urd conflict error. The reason for this is that the program is trying to write a new urd item to an existing key/timestamp with a different contents than what is already in the database. This is considered to be correct behaviour. Solution proposals

  • If the contents was exactly the same, there would be no error

  • If the contents should be updated, add a update=True to the urd.begin() call.

  • Remove the entry by issuing an urd.truncate(timestamp) with a timestamp preceeding the one to be written. This will erase all entries with timestamps larger or equal to the one specified.

  • Check first if the item exists with urd.peek(), and avoid all processing for the item if it already exists.

16.3. The build creates a new job although it already exists

check hash/code

why build?

input parameters (options, datasets, jobs)

workdirs, new, removed?

depends on something that was rebuilt

16.4. Remove items from the Urd database

the files in urd.db/ are human readable, so it is possible to edit them (or remove them), but this should not be necessary!

  • do update=True

  • urd.truncate()

16.5. Connecting to a remote board server

If there are several board servers running on a machine,

each occupies a port…

which port goes where

use sockets like this

16.6. What’s the thing with -LATEST?

The jobid ending with -LATEST is a pointer to the last built job in that work directory. Not the last re-used job.

16.8. How to abort a running job

A running job can be aborted using the ax abort command. No need to restart the server.

16.9. Connect to a remote

Connecting to board or urd servers on a remote server is typically done using ssh and port forwarding. The server can listen to either a port or a socket. The benefit of using a socket is that they are unique for each running server, so that several users can work independently on the same server.

  • This is how to connect using a port:

    To set up a server with board listening on port 1234, enter this in accelerator.conf

    board listen: localhost:1234
    

    and connect using

    ssh -L 9999:localhost:1234 <remote_server>
    

    This forwards port 1234 on the server to you local machine’s port 9999, so pointing a browser to http://localhost:9999 should display the remote board.

  • And here’s how to set up using a socket:

    Enter this in accelerator.conf

    board listen: .socket.dir/board
    

    This will create a socket in the specified path below the project directory. Next, find the absolute path to the socket file (probably by issuing ``realpath .socket.dir/board), and create the ssh forwarding command like this

    ssh -L 9999:/path/to/.socket.dir/board <remote_server>
    

    Connect by pointing browser to http://localhost:9999.

    Note

    In the firs case, using ports, this assumes that the port is not already allocated. Similarly, if there are several users on the machine, each user needs to have a unique port number. This can get messy and difficult to maintain. Better to use sockets, since each instance of the exax server can have its own socket file. And it is still possilbe to have multiple users connecting to the _same_ socket using ssh.

16.10. Setting up a remote urd or board server

By default, starting the exax server (ax server) will also start a board and an urd server. In some cases, it is better to have these run separately, for example

  • the board server could be still running while the exax server is taken down for some maintenance

  • the urd server is shared between several users

  • To run a separate board server, remove the board entry from accelerator.conf:

    # board listen :1234
    

    And start the board server from the same project directory using

    ax board-server localhost:9999
    
  • To run a separate urd server, tell the server which urd it should listen to in ``accelerator.conf

    urd: localhost:5555
    

    In this case, it assumes there is a board server on the same machine on port 5555.

    To start the urd server, run from the project directory

    ax urd-server --listen localhost:5555
    

    Tip

    Use passwords to authenticate different urd users.

    Tip

    To have the urd server listening to _external_ connections, i.e. exax servers running on _other_ servers, replace localhost by the network interface IP number that is used for the access, for example

    ax urd-server --listen 10.1.2.3:5555