Table of Contents
Table of Contents |
---|
Top-Level Architecture
...
The MWAX correlator design employs a multicast architecture, allowing all receivers to each send their streams of high time resolution data to any number of multicast consumers with no additional load on the sender. Initially there will be 24 multicast receivers (one for each coarse channel for correlation and voltage capture), however, the multicast architecture allows for additional multicast receivers to utilize the same high time resolution data for other purposes. For example: RFI monitoring, transient detection and external instruments such as Breakthrough: Listen could commensally consume some or all of the high time resolution data without impacting the operation of the telescope.
The MWAX correlator employs the “FX” architecture where the input time samples for each signal path (antenna and polarization) are fine-channelized prior to cross-correlation, reducing the correlation process to complex multiplications in the frequency domain.
Whereas the legacy correlator utilizes a polyphase filterbank for the F-engine, the MWAX correlator employs the FFT, following the rationale presented in Appendix C. For the X-engine, development time/cost has been minimized by utilizing the existing open-source GPU correlator library “xGPU” (the same library at the heart of the legacy correlator). The standard xGPU library is used essentially unchanged for MWAX, with the exception of a small but crucial change to improve speed and reduce bus/memory traffic, found to be essential to making the design scalable to 256T and beyond. This modification is described in Appendix C.
The 24 coarse channels are processed by 24 GPU-accelerated compute nodes referred to as “MWAX Servers”, with two ‘hot spares’. Each MWAX Server implements the functions shown in the figure below. The real-time data flows on the MWAX Servers are managed through the use of input and output ring buffers that decouple its computational workflow from the input source and output destinations. These ring buffers are established and accessed using the open-source standard library “PSRDADA”.
The execution speed of MWAX is dependent on various parameter/configuration choices, but most significantly the number of tiles to correlate and the GPU hardware employed. For example, sustained real-time 256T correlation has been benchmarked successfully on a NVIDIA RTX2080Ti-based GPU card.
...
Introduction
The MWAX system replaces the previous fine PFB, Voltage Capture System (VCS) (and media converter), correlator, and on-site archive of the Murchison Widefield Array (MWA). All of the fielded instrument hardware (tiles, beamformers, receivers) remains the same, as described in The Murchison Widefield Array: The SKA Low Frequency Precursor by Tingay et al. (2013), and the Phase II description paper: The Phase II Murchison Widefield Array: Design Overview by Wayth et al (2018).
Info |
---|
The new MWAX Correlator system is described in detail https://arxiv.org/abs/2303.11557 (Morrison, et. al. 2023). |
The diagram below shows a high level overview of the complete signal chain, including the main MWAX components: Media conversion and Correlator & Online Archive.
...
Top-Level Architecture
MWAX is located on site at the MRO and comprises 24 new servers (+ 2 spares) and repurposes 10 existing on site servers. Together the equipment occupies three racks. Output visibilities are transferred to Curtin’s B206 data centre, and ultimately the Pawsey Centre, via existing fibre-optic links.
This configuration is capable of simultaneous fine channelisation, cross-correlation, frequency/time averaging, and buffer storage for up to 256 tiles and up to 24 coarse channels of 1.28 MHz each. At the time of writing, the MWA has 16 original receivers plus 2 next-generation National Instruments (NI) receivers, which process 144 tiles in total. This limits the inputs to MWAX such that it currently correlates only 144 tiles out of the deployed 256 tiles in the field.
The MWAX correlator design employs a multicast architecture, allowing all receivers to each send their streams of high time resolution data to any number of multicast consumers with no additional load on the sender. Initially there will be 24 multicast receivers (one for each coarse channel for correlation and voltage capture), however, the multicast architecture allows for additional multicast receivers to utilise the same high time resolution data for other purposes. For example: RFI monitoring, transient detection and external instruments such as Breakthrough: Listen could commensally consume some or all of the high time resolution data without impacting the operation of the telescope.
MWAX employs the “FX” correlation architecture where the input time samples for each signal path (antenna and polarisation) are fine-channelised prior to cross-correlation, reducing the correlation process to complex multiplications in the frequency domain. Whereas the legacy correlator utilised a polyphase filterbank for the F-engine, the MWAX correlator employs the FFT (using the cuFFT implementation). For the X-engine, development time/cost was minimised by utilising the existing open-source GPU correlator library “xGPU” (the same library at the heart of the legacy correlator, as well as others). The standard xGPU library is used essentially unchanged for MWAX, with the exception of a small but crucial change to improve speed and reduce bus/memory traffic, found to be essential to making the design scalable to 256T and beyond (see MWAX xGPU on github).
The 24 coarse channels are processed by 24 GPU-accelerated compute nodes referred to as “MWAX Servers”, with two hot spares. The real-time data flows on the MWAX Servers are managed through the use of input and output ring buffers that decouple its computational workflow from the input source and output destinations. These ring buffers are established and accessed using the open-source ring buffer library “PSRDADA”.
MWAX has been designed such that the exact same code used for real-time operation on the dedicated MWAX Servers at the MRO can also be installed on compute nodes with lower specification GPUs (e.g. Pawsey) to provide an offline mode, operating below real-time speed (although it is a requirement that the GPU support CUDA). Note: The modes available to the offline correlator depend heavily on the server and GPU hardware it is executed on.
The table below shows (in green) which output visibility modes will be able to operate in real-time with the proposed hardware configuration, assuming 256 tiles and 30.72 MHz of instantaneous bandwidth. The figures overlaid on each mode entry are the data rate of visibility data that will be generated in this mode in gigabits per second (Gbps). Modes in red may still be possible for short periods, subject to final hardware specifications and limitations. Note that the number of modes available to astronomers is significantly increased over the legacy correlator.
See: MWAX Correlator Modes (128T)
1.2 MWAX Correlator Signal Path Data Flow
In this section we describe the flow of signals from the MWA receivers to the correlator and then into long term storage See: MWAX Offline Correlator
The output visibility modes of MWAX for 128, 136 and 144 tiles that run real-time on the as-built MWAX hardware configuration are listed at this page: MWAX Modes
Signal Path/Data Flow
In this section we describe the flow of signals from the tiles and receivers to the media converter (Medconv) servers, MWAX Correlator and then into Long Term Storage at the MWA Archive at the Pawsey Supercomputing Centre, illustrated in the diagram below.
...
MWAX Media
...
Conversion
Each of the MWA's existing 16 receivers in the field enter the control building. This proposal calls for them to be directly connected to existing “EDT” cards located in 8 servers (+ 2 hot spares) selected and re-tasked from the existing pool of 16 VCS servers that are currently used in the legacy correlator. These will become media converter “Medconv Servers”, responsible for taking the bespoke protocol from the receivers and converting them to the same data format we directly receive from NI FlexRIO and SNAP based receivers (both of which are potential future MWA receiver candidates).The re-use of existing servers for this media conversion task is not a strict requirement, however we have used this make and model server successfully for years, both as an integral part of the legacy correlator and for testing and development for MWAX. We understand this hardware well. The need for only 8+2 units frees up the remainder of the identically configured servers for use as ‘whole-of-life’ spares. These Medconv Servers would be decommissioned as the existing legacy receivers they support themselves age and are eventually retired or replaced with new generation receivers.Within the Medconv Servers, 2048 consecutive packets of bespoke format 5-bit receiver voltage data are received on an EDT FPGA card, copied to server memory, merged, padded into 8-bit format and then sorted by input signal chain and destination coarse channel number. They are then passed as 128 IP multicast packets out a conventional 10Gb Ethernet port into the local area, multicast voltage network. Each coarse channel forms a separate multicast stream. All voltage data packets are sent via aggregation switches and passed up to the existing Cisco 9504 switch which acts as the core switch for the correlator. i.e. all voltage data for all coarse channels is available on the fabric of this core switch.
The voltage data output from the Medconv Servers is compatible with the direct output from NI FlexRIO and SNAP based receivers in timing, packet format and coarse channel bandwidth. No further resampling or conversion is needed to cross-correlate a heterogeneous array of these three receiver types.
MWAX UDP Capture + Voltage Capture To Disk
...
Packets from the multicast stream are assembled in shared memory (RAM) into 8 second blocks of high time resolution voltage data based on their time and source (known as a “sub-observation”). At the completion of each 8 second block, the RAM file is closed and made available to another process. Depending on the current observing mode, the block may be:
...
send 8 tiles worth of 24 coarse channels over 48 fibre optic cables using the Xilinx RocketIO protocol. The fibre optic cables terminate in the MRO Control Building, where two bundles of three fibres connect to a media conversion (medconv) server via custom Xilininx FPGA cards (colloquially known as EDT cards- EDT standing for Engineering Design Team- the company that manufactures them). Six independent processes of edt2udp on each medconv server convert the RocketIO data into Ethernet UDP packets which are sent out to the Cisco Nexus 9504 switch as multicast data where each coarse channel is assigned a multicast address. This provides the "corner-turn" where each of the 6 processes on each medconv server is sending one third of the coarse channels for one sixteenth of the tiles in the array.
Info |
---|
NOTE: The next-generation receivers (such as the NI receivers) do not need media conversion, and therefore they send data directly to the Cisco Nexus 9504 switch. |
MWAX Correlator
This section describes the data flow within the MWAX correlator servers. The function and data flow between components is shown in the below diagram:
...
MWAX UDP Capture + Voltage Capture To Disk
As is the case with IP multicast, any device on the voltage network can “join” the multicast for one or more coarse channels and a copy of the relevant stream will be passed to them at almost no cost to the Cisco Nexus 9504 switch.
For MWAX, an array of 24 (+2 spare) identically configured MWAX Servers are connected to the core Cisco 9504 switch (on site at the MRO) via 40 Gb Ethernet. Each MWAX Server is responsible for one coarse channel of bandwidth 1.28 MHz. For a 256T array, the input data volume is approximately 11 Gbps per coarse channel.
The mwax_u2s process receives packets from the multicast stream which are then assembled in shared memory (RAM) into 8 second blocks of high time resolution voltage data based on their time and source (known as a “sub-observation”). At the completion of each 8 second block, the RAM file is closed and made available to the mwax_subfile_distributor process.
Depending on the current observing mode, the block may be handled by the mwax_subfile_distributor process in the following way:
Retained in RAM for a period to satisfy triggered ‘buffer dump’ commands
Written immediately to disk for voltage capture mode
Passed to the FX Engine for cross-correlation via a PSRDADA ring buffer
A 256T sub-observation buffer for one coarse channel, for the 8 seconds is approximately 10 11 GB in size. The proposed hardware can buffer approximately 2 minutes in its available memory.
MWAX Metadata and Fringe Tracking ‘Delay Engine’
...
MWAX Correlator FX Engine
The MWAX correlator FX Engine is implemented as a PSRDADA client; a single process that reads/writes from/to the input/output ring buffers, while working in a closely coupled manner with a single GPU device - which can be any standard NVIDIA/CUDA-based GPU card.
The figure below shows the processing stages and data flows within the MWAX correlator FX Engine process.
The FX Engine treats individual 8 second sub-observations as independent work units. Most of its mode settings are able to change on-the-fly from one sub-observation to the next. It operates on 50 ms units of input data, i.e. a total of 160 blocks over each 8 second sub-observation. An additional block of metadata (of the same size as a 50 ms data block) is prepended to the data blocks, making a total of 161 blocks per sub-observation. At the start of each new sub-observation, the metadata block is parsed to configure the operating parameters for the following 160 data blocks.
Each 50 ms data block consists of 64,000 time samples for each signal path presented at the input (number of tiles x 2 polarizations). The 256T correlator configuration supports up to 512 signal paths. Input data is transferred to GPU memory and promoted from 8-bit integers to 32-bit floats. The 64,000 time samples of each path are partitioned into 10 blocks of 6,400 samples (5 ms), each of which is FFT’d on the GPU using “cuFFT”, resulting in 10 time samples on each of 6,400 ultrafine channels of resolution 200 Hz.
Fractional delays are then applied to each signal path to point to a specified correlation pointing centre. The required delay values are generated within the Delay Engine and passed to the FX Engine via the prepended metadata block written to the input ring buffer. Delays for the start and end of every 50 ms data block are provided, from which the FX Engine linearly interpolates the delay to be applied to each 5 ms sub-block. Delays are applied by multiplying the frequency-domain samples of each sub-block by a phase gradient, whose complex gain values are taken from a pre-computed look-up table to increase speed.
The FFT’d data is then transposed to place it in the order that xGPU requires (slowest-to-fastest changing): [time][channel][tile][polarization]. As the data is re-ordered, it is written directly into xGPU’s input holding buffer in GPU memory. The data from five 50 ms blocks is aggregated in this buffer, corresponding to an xGPU “gulp size” of 250 ms. The minimum integration time is one gulp, i.e. 250 ms. The integration time can be any multiple of 250 ms for which there are an integer number of gulps over the full 8 second sub-observation.
xGPU places the computed visibilities for each baseline, with 200 Hz resolution (6,400 channels), in GPU memory. A GPU function then performs channel averaging according to the “fscrunch” factor specified in the metadata block, reducing the number of output channels to (6400/fscrunch), each of width (200*fscrunch) Hz. During this averaging process, each visibility can have a multiplicative weight applied, based on a data occupancy metric that takes account of any input data blocks that were missing due to lost UDP packets or RFI excision (a potential future enhancement). The centre (DC) ultrafine channel is excluded when averaging and the centre output channel values are re-scaled accordingly. Note that only 200 Hz of bandwidth is lost in this process, rather than a complete output channel. The averaged output channel data is then transferred back to host memory.
xGPU utilizes data ordering that is optimized for execution speed. The visibility set is re-ordered into a more intuitive triangular order by the CPU: [time][baseline][channel][pol]. The re-ordered visibility sets (one per integration time) are then written to the output ring buffer.
Visibility Data Capture
The data capture process, running on each MWAX Server, reads visibility data off the output PSRDADA ring buffer and writes the data into FITS format. The data capture process breaks up large visibility sets into files of up to approximately 5 GB each, in order to optimize data transfer speeds while keeping the individual visibility file sizes manageable. The FITS files are written onto a separate partition on the MWAX Server disk storage.
Transfer to Curtin Data Centre Temporary Storage
An archiving process (mwax_mover), running on each MWAX Server, watches the directory to which the data capture software writes the output visibility FITS files, as well as the directory to which the voltage data is written, and adds these files to a queue.
mwax mover iterates through each file in the queue, firstly creating a record in the MWA metadata database for each file and secondly attempting to transfer them to an MWAcache Server at the Curtin 206 data centre via the MRO-to-Perth 100 Gbps link. The transfer is performed using xrootd, a high performance, reliable and scalable file transfer protocol. In the event that the 100 Gbps link to the Curtin 206 data centre is down, or there are issues with the MWAcache Servers, the mwax_mover process will requeue the files and attempt to transfer them when the systems or link are back online.
There are ten MWAcache Servers (eight in use, with two spares) and each MWAcache Server receives three coarse channels (one each from three different MWAX Servers). The MWAcache Servers each run an instance of xrootd to facilitate reliable, high-speed transfer from the MWAX servers.
Transfer and Storage at Pawsey Long Term Archive
The mwax_mover instance running on each MWAcache Server at Curtin University will be configured to automatically forward all visibility and voltage files to the Pawsey Front-End (FE) Servers, where the data will be stored in Pawsey’s hierarchical storage management system (HSM). Once data is successfully transferred to Pawsey the data will be moved to a cache which will also be managed by mwax_mover. MWAX mover will cache up to a configurable high-water mark (e.g. 85% of volume capacity) and will remove the oldest data to remain under that threshold.
The NGAS servers running at Pawsey will execute the commands needed to write the visibilities and voltages to Pawsey’s tape subsystem. The NGAS software then updates the MWA metadata database to tag the files as being safely archived at Pawsey.
In the event that there is an issue with the systems at Pawsey (e.g. monthly maintenance or any other system error), or the 100 Gbps link between the Curtin 206 data centre and Pawsey is down, mwax_mover will requeue the transfer and try again at a later time. Each MWAcache Server has approximately 250 TB of usable storage space to store visibilities and voltage data temporarily until it can be successfully archived.
Researchers will access correlated data using the existing MWA All-Sky Virtual Observatory data portal (MWA ASVO), where users will either process the raw visibilities themselves or they may opt to have the MWA ASVO convert the data to CASA measurement set or uvfits format (with or without applying calibration solutions). Science teams wishing to utilize the high time resolution voltage data will use the existing “voltdownload” utility to obtain the raw voltage files. In a future release, the MWA ASVO will be upgraded to make use of any data cached on the MWAcache servers in B206, bypassing the need to stage files from the Pawsey tape subsystem and allowing the data to be delivered to users with less delay.
MWAX Correlator FX Engine
The MWAX correlator FX Engine (mwax_db2correlate2db) is implemented as a PSRDADA client; a single process that reads/writes from/to the input/output ring buffers, while working in a closely coupled manner with a single GPU device - which can be any standard NVIDIA/CUDA-based GPU card.
The figure below shows the processing stages and data flows within the MWAX correlator FX Engine process.
...
The FX Engine treats individual 8 second sub-observations as independent work units. Most of its mode settings are able to change on-the-fly from one sub-observation to the next. Each 8 second sub-observation file contains 160 blocks of 50 ms of input data each. An additional block of metadata (of the same size as a 50 ms data block) is prepended to the data blocks, making a total of 161 blocks per sub-observation file. At the start of processing each new sub-observation file, the metadata block is parsed to configure the operating parameters for the following 160 data blocks.
Each 50 ms data block consists of 64,000 time samples for each signal path presented at the input (number of tiles x 2 polarisations). The 256T correlator configuration supports up to 512 signal paths. Input data is transferred to GPU memory and promoted from 8-bit integers to 32-bit floats. The 64,000 time samples of each path are partitioned into 10 blocks of 6,400 samples (5 ms), each of which is FFT’d on the GPU using “cuFFT”, resulting in 10 time samples on each of 6,400 ultrafine channels of resolution 200 Hz.
The design supports the ability to apply delays to each signal path to point the telescope to a specified correlation pointing centre. This involves the integer sample component of the required delays being applied using whole-sample shifts as the sub-observation file is assembled (mwax_u2s), with the residual fractional delays being applied within the FX Engine. The required fractional delay values for each signal path are passed to the FX Engine via the prepended metadata block written to the input ring buffer. Delays are applied by multiplying the frequency-domain samples of each sub-block by a phase gradient, whose complex gain values are taken from a pre-computed look-up table to increase speed. Delay corrections can be static over the entire sub-observation (e.g. to eliminate fixed cable delays) or dynamic over the sub-observation (with a 5 ms resolution) to implement fringe stopping, i.e. to keep the correlation pointing centre on a fixed RA/Dec.
The FFT’d data is then transposed to place it in the order that xGPU requires (slowest-to-fastest changing): [time][channel][tile][polarization]. As the data is re-ordered, it is written directly into xGPU’s input holding buffer in GPU memory. The data from five 50 ms blocks is aggregated in this buffer, corresponding to an xGPU “gulp size” of 250 ms. The minimum integration time is one gulp, i.e. 250 ms. The integration time can be any multiple of 250 ms for which there are an integer number of gulps over the full 8 second sub-observation.
Visibility Channelisation
xGPU places the computed visibilities for each baseline, with 200 Hz resolution (6,400 channels), in GPU memory. A GPU function then performs channel averaging according to the “fscrunch” factor specified in the PSRDADA header, reducing the number of output channels to (6400/fscrunch), each of width (200*fscrunch) Hz. For example, with fscrunch = 50, there will be 128 output visibility channels of 10 kHz each.
The output visibility channels are "centre symmetric" in the way their boundaries are aligned within the coarse channel bandwidth. The centre output fine channel is centred symmetrically on the centre of the coarse channel (as was the case with the legacy correlator/fine PFB). Remaining output channels extend above and below the centre channel symmetrically. In cases where there is an odd number of output channels across the coarse channel, there are full-width channels at the lowest and highest ends of the coarse channel. In cases where there is an even number of output channels across the coarse channel, there are half-width channels at the lowest and highest ends of the coarse channel. See: MWA Fine Channel Centre Frequencies
The channel averaging process involves the summation of the complex visibility values for all the ultrafine channels comprising each output fine channel. That is, each output visibility channel is formed by the sum of fscrunch complex values. Note that for the centre output fine channel, the centre (DC) ultrafine channel is excluded from the summation to remove any DC component present in the coarse channel, and the output value is re-scaled accordingly to maintain a consistent output magnitude with other channels. Only 200 Hz of bandwidth is lost in this process, rather than a complete output channel (as was the case with the legacy correlator where an entire 10 kHz was lost).
During the channel averaging process, each visibility has a multiplicative weight applied to normalise the magnitude of output visibilities to a desired absolute scale. This normalisation factor is automatically adjusted according to the selected integration time and fscrunch factor such that output visibility values remain at a consistent magnitude. In future, this normalisation factor will be able to be optionally combined with additional baseline-specific weighting factors based on the input data occupancy, i.e. taking account of any input data blocks that were missing due to lost UDP packets or pre-correlation RFI excision (a potential future enhancement). The weighting factors applied to each baseline (which are common to all fine channels of that baseline) are placed in the output ring buffer along with the visibility values themselves. This allows downstream processes to track what weightings were applied. At present no visibility weightings are applied and the output weight values are all set to 1.0.
The averaged output channel data is transferred back to host memory where it is re-ordered before writing to the output ring buffer.
Visibility Re-ordering
xGPU utilizes a particular data ordering that is optimized for execution speed. The visibility set is re-ordered by the CPU into a more intuitive triangular order: [time][baseline][channel][polarization]. See: MWAX Visibility File Format
The re-ordered visibility sets (one per integration time) are then written to the output ring buffer.
Visibility Data Capture
The data capture process (mwax_db2fits), running on each MWAX Server, reads visibility data off the output PSRDADA ring buffer and writes the data into FITS format. The data capture process breaks up large visibility sets into files of up to approximately 10 GB each, in order to optimize data transfer speeds while keeping the individual visibility file sizes manageable. The FITS files are written onto a separate partition on the MWAX server disk storage.
Transfer to Curtin Data Centre Cache Storage
Each MWAX server has enough disk storage for around 30 TB of visibilities plus 30 TB of voltage data, effectively replacing the need for a separate "Online Archive" cluster of servers as the legacy MWA had. In normal operating modes and schedule, this means the MWA can continue to observe for a week or two even if the link to Perth is offline- data will continue to be stored on disk until the link is online again, and will then begin transmission to Perth.
An archiving process (mwax_subfile_distributor), running on each MWAX Server, watches the directories to which the data capture software (mwax_db2fits) writes the output visibility FITS files, as well as the directory to which the voltage data is written, and adds these files to a queue.
mwax_subfile_distributor iterates through each file in the queue, computes a checksum, then creates a record in the MWA metadata database for each file. It then attempts to transfer the files to an mwacache server at the Curtin 206 data centre via the MRO-to-Perth 100 Gbps link. The transfer is performed using xrootd, a high performance, reliable and scalable file transfer protocol. In the event that the 100 Gbps link to the Curtin 206 data centre is down, or there are issues with the mwacache servers, the mwax_subfile_distributor process will requeue the files and attempt to transfer them when the systems or link are back online.
There are 10 mwacache servers (8 in use, with 2 spares) and each mwacache server receives 3 coarse channels (one each from 3 different MWAX servers). The mwacache servers each run an instance of xrootd server to facilitate reliable, high-speed transfer from the MWAX servers.
Transfer and Storage at Pawsey Long Term Archive
The mwacache_archiver instance running on each mwacache server at Curtin University is configured to automatically forward all visibility and voltage files to the Pawsey LTS (long term storage) systems: Acacia or Banksia. mwacache_archiver updates the MWA metadata database to tag the files as being safely archived at Pawsey.
In the event that there is an issue with the storage systems at Pawsey (e.g. monthly maintenance or any other system error), or the 100 Gbps link between the Curtin 206 data centre and Pawsey is down, mwacache_archiver will requeue the transfer and try again at a later time. Each mwacache server has approximately 250 TB of usable storage space to store visibilities and voltage data temporarily until it can be successfully archived.
Researchers will be able to access visibilities and voltages using the existing MWA All-Sky Virtual Observatory data portal (MWA ASVO), where users will either download and process the raw data themselves (especially in the case of voltage data) or they may opt to have the MWA ASVO convert the data to CASA measurement set or uvfits format (with or without applying calibration solutions), averaging, etc. For more information see: Data Access.
More information
Child pages (Children Display) |
---|