Monitor & Control Overview

Monitor & Control Overview

 

The Monitor and Control (M&C) systems are in place to control the primary functions of the MWA remotely. These tasks are divided into broad areas:

  • Telescope configuration: The physical connectivity of the entire system, and a user interface for managing the contents as hardware is changed in the field. For each of the 256 physical tiles on the ground, we must store data about that tile, including its coordinates, beamformer ID and gain, cable type and length to the receiver, and which input on which receiver it is connected to, etc. We must also record how the data from each receiver are mapped to the correlator. The configuration is stored in tables in a PostgreSQL database. We store the configuration of the telescope at all times in the past, as well as the current configuration.

  • Telescope state: The state of the telescope is defined as the set of values for all software-selectable parameters that can be changed by the user for scientific or engineering test observations. The desired state of the telescope at all times (past and future) is the telescope schedule, stored in a set of database tables in the same PostgresSQL database. The status monitoring system records the actual (as opposed to desired) state of the telescope, at all past times. If there was an error setting the state of some specific element, the difference between the actual and desired states can be dealt with when analysing the recorded data.

  • Telescope telemetry: Sensor data, to be used for immediate response to faults (e.g., overheating) as well as being archived for later use as required when performing in-depth data analysis (e.g., the effect of beamformer temperature on system gain). This includes data from thousands of sensors (temperature, voltage, current, humidity, etc) distributed through the equipment. It also includes the tools to visualise this data - capturing data is useless unless it can be used.

In the above descriptions, and throughout the rest of this page, the term ‘configuration’ refers to physical layout and connectivity – what is plugged into where. The configuration database tables are updated manually using a web interface, to reflect changes on site (which happen very rarely). The term ‘state’ refers to the settings that can vary from one observation to the next (frequency, attenuation, pointing, data capture modes). Future state descriptions (‘observations’) are added to the schedule tables using a command line tool. A few seconds before the start of each observation, the new desired state is sent to all the hardware components by the ‘schedule controllers’ (programs which run continuously, monitoring the schedule tables).

 

 

Video: Andrew Williams, instrumentation engineer who developed the monitor and control system for the MWA, speaks at PyCon 2017 about the telescope's software.



M&C Design Philosophy

From the very beginning it was assumed that as a ‘large N small D’ instrument, the MWA would need to be able to function normally with some of its parts (dipoles, tiles, receivers) not working, and that ‘not working’ could mean anything from physically disconnected through to ‘not satisfying some observation-specific criteria’. It was also assumed that the MWA would operate as a fully automated telescope, 24 hours a day, without any human ‘operator’ to act on error messages and make real-time observing decisions. These underlying assumptions drove the M&C design, and led to these design goals:

  • That it should be possible to determine the configuration, desired state, and actual state of the telescope both at the current instant, and at any time in the past, from the M&C database.

  • That the M&C scheduling and control system should be able to reach every hardware and software setting in the array, so that the same scheduling system can be used for both science observations and engineering tests.

  • That the M&C system should be able to change state rapidly – ending the current observation and starting a new one with no more than 8 seconds of latency.

  • That the schedule controller would do its best to carry out observations even if some elements (receivers, tiles) were not communicating, or had been flagged as non-functional, rather than failing with an error. The default action should always be to record data if possible, but record any errors (e.g., differences between desired and actual state) so that the usefulness of the data can be determined later.

  • That the specific choice of which tiles (and dipoles) to include in a given scheduled observation should be instantiated as the observation is actually started, not days or weeks in advance when it’s added to the schedule database. For example, one observation might specify that it should only use tiles that are in some ‘perfectly healthy’ set, and another observation might request all of the tiles. Allocating which tiles were in which ‘tile set’ would be done by the operations team, or by some automated analysis.

  • That the M&C system should be able to mitigate the effects of hardware failures. For example, the schedule can specify which individual dipoles in each tile should be switched out of the final sum for a tile, and which tiles should be physically powered down during an observation to prevent emitted RFI in some hardware failure modes.

Another main consideration was that resources were limited, and the M&C software was being written by a handful of scientists, none able to work on it full-time. This led to the decision to aim towards the use of existing open-source tools where possible, and to make the M&C system a collection of small, inter-operating tools rather than a monolithic system.



M&C Implementation

Database

All of the configuration and state data for the MWA is stored in a PostgreSQL database hosted on-site, replicated in real time to an Amazon EC2 instance in the Amazon Web Services cloud. Operations that need low-latency (from on-site, to control hardware during observing), or to write to the database, are made using connections to the on-site server. Most other access web services for humans or code (ASVO, etc) to access metadata, search for observations, etc, is done using connections to the read-only replica in AWS to avoid putting unnecessary load on the on-site server.



Schedule Controllers

There are three daemons (programs running continuously) that do all the work required to send new observation state information to all the hardware and software components of the MWA. Each of these schedule controllers operate the same way – a few seconds before the start of each new observation, the data describing the new telescope state is sent to each of the clients attached to that controller. Each client then waits until the instant the observation is due to start, then changes its hardware state to match the desired new state.

  • Obscontroller: Written in Java, this controller communicates with the original RRI receiver enclosures on site, each of which digitises the signals from 8 tiles. ObsController only checks to see that the client has received the new state data, it doesn’t wait to see if the hardware changes succeeded. Instead, status information from each of the receiver enclosures (hardware errors, etc) is sent over the network to a separate ‘Status Monitor’ daemon, which records the actual state in the M&C database.

  • PyController: Written in Python, this accepts ‘registrations’ over the network from software clients that wish to be informed about state changes for one or more tiles. It was written to communicate with the MWA Phase II long-baseline tiles connected using fibre instead of coaxial cable, but it was designed to be flexible enough to accommodate a wide range of client types, such as external instruments like the Engineering Development Array. Status information (hardware failures, etc) is returned directly from each client, and recorded in the M&C database.

  • RedisController: Also in Python, this uses Redis (an open-source in-memory caching package) as a message bus to distribute commands to the MWAX correlator hosts to manage the real time correlator, voltage capture, and the new real-time beamformer backend.



Receiver code on the MWA receiver control computers

The original MWA receiver hardware from the Raman Resarch Institure (RRI) is described in detail elsewhere (Tingay et al. 2013Prabu et al. 2014). It runs on a single-board computer (SBC) that is now old, unsupported by the manufacturer, and not quite x86 compatible. This means that it’s limited to the existing operating system (Debian 6.0/squeeze, Linux 2.6.32 kernel, Python 2.6.6) and with limited processing resources. The hardware controlled by the SBC is attached using a mix of USB, I2C and pin-level digital IO connections. It is managed by eight small daemons, each carrying out a single task. Most are written in Python, but three (usb_control, gpio_control, and receiverClient) are written in C. For historical reasons, they communicate with each other through tables in a local PostgreSQL server instance running on the local SBC.

  • magicstart’, handling remote control of powerup, initialisation (programming firmware, etc), and shutdown.

  • 'usb_control’, handling all communication with the ‘digital crate’ that does all data processing. Once the receiver has been powered up and initialised, this is limited to changing the frequency selection (which 24 coarse channels are sent out over the fibre), and reading a full 256 channel integrated power spectrum from one of the 8 tiles every 8 seconds. These power spectra are written directly to the telemetry database.

  • gpio_control’, which handles all digital IO (turning pins on and off) using the dedicated GPIO card. This includes sending new delay settings to all the tiles, turning power supplies on and off and sensing their state, and air conditioner control (compressor and fan state).

  • i2c_control’, which handles all serial communications over the several I2C buses inside the receiver. This includes reading values from a few tens of temperature, current, and humidity sensors every few seconds, as well as allowing control of the programmable attenuators on each of the sixteen input channels (8 tiles in X and Y).

  • thermal_loop’ handles temperature control inside the receiver. It uses internal temperatures (read by i2c_control) to manage the air conditioner compressor and fan state as required (using gpio_control). It also shuts the digital crate power down entirely if any of the temperature sensors exceed predefined maximum limits.

  • watchdog’ is a process that periodically resets the hardware watchdog timer on the SBC board if all functions in the receiver are working normally (temperatures being read, thermal_loop managing the air conditioner, etc). If the watchdog process stops, or fails to reset the hardware timer for more than a few tens of seconds, the onboard computer will automatically reboot.

  • receiverClient’ communicates with ObsController and gets the new state data for each observation. It uses i2c_control, gpio_control and usb_control to send new attenuation, pointing and frequency data to the hardware as required, at the exact time each new observation starts.

  • 'sendStatus' is the process that sends the internal receiver state – frequency, attenuation, pointing, and temperature – back to the StatusMonitor process on the main M&C server, where it is logged in the M&C database.

 

Newer receivers (NI, SHAO) have the above functionality implemented in firmware, and each communicates with a dedicated client process running on an M&C server. This client process accepts observation commands from RedisController and passes the desired hardware settings to the connected NI or SHAO receiver via HTTP REST calls.

 

Telemetry

The M&C system monitors over 62,000 quantities like temperature/voltage/current/power measurements, clock offsets, network travel times, fan speeds, etc. Some of this data is relevant to science and made available to be used during data reduction (beamformer temperatures at the start of each observation, full-spectrum power from each tile every few seconds), but most of it is only useful for fault detection and diagnostics (server fan speeds, free disk space, etc).

Most of the telemetry data is collected and stored using an open-source packages ('Icinga2', ‘Graphite’ and stored in a ‘round robin’ style format, where measurements are saved at frequent intervals for some days or weeks, but automatically averaged to longer intervals for a longer period, and automatically discarded after some time. For example, server telemetry (motherboard temperatures, fan speeds) are saved at 5-minute intervals for 90 days, then this data is averaged to 1 hour and saved for 10 years, then discarded. The data visualisation is done using Grafana, reading data directly from the Graphite database as well as PostgreSQL and other data sources. We have dozens of ‘dashboards’ of frequently used graph layouts. This is used for monitoring the health of the telescope.

Telemetry data that is relevant to the actual data analysis (or may at some future time be relevant) is stored indefinitely in PostgreSQL tables on a separate server (not in the main M&C database).



Real-time telescope health monitoring

We use the open-source system Icinga2 (similar to Nagios) to monitor the telescope health in real time, including both traditional server farm diagnostics (disk space, rack temperatures, which processes are running, query response times, etc, on dozens of servers) as well as custom diagnostics (autocorrelation power for each tile, software correlator status, etc). Icinga2 has flexible web dashboard panels summarising the health (OK, WARNING, CRITICAL, or UNREACHABLE) of all the telescope components in a hierarchical structure. It generates emailed alerts to the operations team when there are problems, and takes action itself for time-critical faults (for example, powering down entire server racks if the temperature is too high).



Web pages and web services

There are a number of human-readable web pages used for telescope diagnostics, as well as machine-readable web services used by remote clients to collect metadata about observations. Most of these are Django services (available from the M&C server on site, and also from another Django server in Amazon’s AWS cloud, using the local database mirror), but there are a few traditional CGI scripts as well as static HTML and image files updated automatically at regular intervals.