...
The 24 coarse channels are processed by 24 GPU-accelerated compute nodes referred to as “MWAX Servers”, with two ‘hot spares’hot spares. Each MWAX Server implements the functions shown in the figure belowabove. The real-time data flows on the MWAX Servers are managed through the use of input and output ring buffers that decouple its computational workflow from the input source and output destinations. These ring buffers are established and accessed using the open-source ring buffer library “PSRDADA”.The execution speed of MWAX is dependent on various parameter/configuration choices, but most significantly the number of tiles to correlate and the GPU hardware employed. For example, sustained real-time 256T correlation has been benchmarked successfully on a NVIDIA RTX2080Ti-based GPU card.
MWAX has been designed such that the exact same code used for real-time operation on the dedicated MWAX Servers at the MRO can also be installed on compute nodes with lower specification GPUs (e.g. Pawsey) to provide an offline mode, operating below real-time speed. Note: The modes available to the offline correlator depend heavily on the server and GPU hardware it is executed on. See: MWAX Offline Correlator
The output visibility modes of MWAX using 128 tiles that run real-time on the as-built MWAX hardware configuration are available at this page: MWAX Correlator Modes (128T)
...
As per standard IP multicast, any device on the voltage network can “join” the multicast for one or more coarse channels and a copy of the relevant stream will be passed to them.
For MWAX a new , an array of 24 (+2 spare) identically configured MWAX Servers will be are connected to the core Cisco 9504 switch via 40Gb Ethernet. Each MWAX Server is responsible for one coarse channel (1.28 MHz) of bandwidth. For a 256T array, the data volume is approximately 11 Gbps per coarse channel.
...
A 256T sub-observation buffer for one coarse channel, for the 8 seconds is approximately 10 11 GB in size. The proposed hardware can buffer approximately 2 minutes in its available memory.
MWAX Correlator FX Engine
...
The FFT’d data is then transposed to place it in the order that xGPU requires (slowest-to-fastest changing): [time][channel][tile][polarization]. As the data is re-ordered, it is written directly into xGPU’s input holding buffer in GPU memory. The data from five 50 ms blocks is aggregated in this buffer, corresponding to an xGPU “gulp size” of 250 ms. The minimum integration time is one gulp, i.e. 250 ms. The integration time can be any multiple of 250 ms for which there are an integer number of gulps over the full 8 second sub-observation.xGPU utilizes data ordering that is optimized for execution speed. The visibility set is re-ordered into a more intuitive triangular order by the CPU: [time][baseline][channel][pol]. The re-ordered visibility sets (one per integration time) are then written to the output ring buffer.
Visibility Channelisation
xGPU places the computed visibilities for each baseline, with 200 Hz resolution (6,400 channels), in GPU memory. A GPU function then performs channel averaging according to the “fscrunch” factor specified in the metadata blockPSRDADA header, reducing the number of output channels to (6400/fscrunch), each of width (200*fscrunch) Hz. During this averaging process, each visibility can have a multiplicative weight applied, which in future can be based on a data occupancy metric that takes account of any input data blocks that were missing due to lost UDP packets or RFI excision (a potential future enhancement). The centre (DC) ultrafine channel is excluded when averaging and the centre output channel values are re-scaled accordingly. Note that only 200 Hz of bandwidth is lost in this process, rather than a complete output channel (10kHz in the legacy correlator). The averaged output channel data is then transferred back to host memory.
The centre output fine channel is centred symmetrically on the centre of the coarse channel (as was the case with the legacy correlator/fine PFB). See: MWA Fine Channel Centre Frequencies
Visibility Re-ordering
xGPU utilizes data ordering that is optimized for execution speed. The visibility set is re-ordered into a more intuitive triangular order by the CPU: [time][baseline][channel][polarization]. The re-ordered visibility sets (one per integration time) are then written to the output ring buffer.
Visibility Data Capture
The data capture process, running on each MWAX Server, reads visibility data off the output PSRDADA ring buffer and writes the data into FITS format. The data capture process breaks up large visibility sets into files of up to approximately 10 GB each, in order to optimize data transfer speeds while keeping the individual visibility file sizes manageable. The FITS files are written onto a separate partition on the MWAX Server disk storage.
...
Transfer and Storage at Pawsey Long Term Archive
The mwaxmwax_mover instance running on each MWAcache Server at Curtin University will be is configured to automatically forward all visibility and voltage files to the Pawsey Front-End (FE) Servers, where the data will be are stored in Pawsey’s hierarchical storage management system (HSM). mwax_mover updates the MWA metadata database to tag the files as being safely archived at Pawsey.
...
Researchers will be able to access visibilities and voltages using the existing MWA All-Sky Virtual Observatory data portal (MWA ASVO), where users will either download and process the raw data themselves (especially in the case of voltage data) or they may opt to have the MWA ASVO convert the data to CASA measurement set or uvfits format (with or without applying calibration solutions), averaging, etc.
...