Getting started with Agilex 3 and floating-point DSP operations

I have been working with Xilinx (now AMD) FPGAs for years, so when the Agilex 3 FPGA and SoC C-Series Development Kit landed on my desk I knew the transition would involve some hard changes. I used Quartus in my University days with a terasic kit based on a Cyclone II and later a Cyclone IV, and I remembered that Quartus, however the current version, 26.1 has nothing similar to that one. Quartus is not Vivado, the terminology differs, and even the project-file philosophy is different.

Regarding the hardware, one feature caught my attention immediately: the Agilex 3 fabric includes Variable Precision DSP blocks that support native IEEE 754 single-precision floating-point multiply-accumulate in hardware, no soft IP required. This is very interesting because Agilex 3 are the cost-effective devices of the Agilex family, however it features this kind of blocks that we can find only in the Versal ACAP devices from AMD.

In this article I will walk through the whole journey: setting up Quartus Prime Pro on Linux, building a trivial LED-blink design to validate the board, and then jumping into a much more interesting project — a three-band Linkwitz-Riley FIR audio crossover that runs entirely in FP32, using the Pmod I2S2 for audio I/O. By the end you will have a clear picture of what the Agilex 3 native floating-point DSP can do for signal-processing workloads.

The board
Quartus Prime Pro — a Vivado user’s dictionary
Setting up Linux for USB Blaster III
Project 1 — LED blink
Native floating-point DSP on Agilex 3
Project 2 — FP32 audio crossover
Conclusions

The board

The development kit is the Agilex 3 FPGA and SoC C-Series Development Kit , built around the A3CW135BM16AE6S device. It features an Agilex 3 FPGA fabric with a dual-core Arm Cortex-A55 HPS (Hard Processor System) and LPDDR4 memory accessible through HPS EMIF. For this article I only use the FPGA fabric side, but having the HPS available opens the door for Linux-based control in a future post.

The board comes with a 100 MHz reference clock (hvio_pllrefclk on pin AJ27), two user LEDs, a pushbutton reset, and what makes the I2S project possible: a Raspberry Pi HAT GPIO header, that shares some pins with a PMOD compatible connector. That header gives us direct access to 3.3-V LVCMOS I/O pins, which is exactly what the Digilent Pmod I2S2 expects.

Quartus Prime Pro — a Vivado user’s dictionary

If you come from the Xilinx/AMD ecosystem, the first thing you will notice is that Quartus uses a different vocabulary for concepts that are essentially the same. Here is a quick translation table to save you the initial confusion:

Vivado concept	Quartus equivalent	Notes
`.xpr` project file	`.qpf` project file	Both store the project metadata
`.xdc` constraints	`.sdc` + pin TCL / `.qsf`	SDC handles timing; pin assignments go in the `.qsf` or a separate TCL script
Block Design (`.bd`)	Platform Designer (`.qsys`)	Graphical IP integration tool
IP Catalog	IP Catalog	Same idea, different IP library
Bitstream (`.bit`)	SRAM Object File (`.sof`)	Programming file for volatile configuration

The most important difference for day-to-day work is that Quartus separates timing constraints (SDC) from pin assignments. In Vivado, your .xdc file contains both create_clock and set_property PACKAGE_PIN commands. In Quartus, timing lives in the .sdc file while pin assignments end up in the .qsf settings file or in a dedicated TCL script that is sourced at project-creation time. This separation is actually cleaner once you get used to it.

Setting up Linux for USB Blaster III

The Agilex 3 dev kit enumerates as a USB Blaster III over USB. On a fresh Linux installation, the JTAG daemon (jtagd) will fail to access the device unless you add the proper udev rules. Running lsusb should show something like:

Bus 001 Device 014: ID 09fb:6022 Altera AG3C_SoC_DK

Create the file /etc/udev/rules.d/51-usbblaster.rules with rules for all USB Blaster variants:

# USB Blaster
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6001", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6002", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6003", MODE="0666"

# USB Blaster II
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6010", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6810", MODE="0666"

# USB Blaster III (Agilex 3 dev kit)
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6020", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6022", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6024", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6025", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="09fb", ATTR{idProduct}=="6026", MODE="0666"

Then reload and apply the rules:

sudo udevadm control --reload-rules
sudo udevadm trigger
sudo killall jtagd

After this, quartus_pgm and the Quartus GUI programmer should detect the device without root privileges.

Project 1 — LED blink

Before touching floating-point DSP, it makes sense to validate the most basic workflow: create a project, compile, program. The classic LED blinker does the job.

The project uses a TCL script that sets up the device, source files, and constraints in a single reproducible step. Here is the core of scripts/create_project.tcl:

package require ::quartus::project

set PROJECT_NAME  agilex3_base
set REVISION_NAME agilex3_base

if { [project_exists $PROJECT_NAME] } {
  project_open -revision $REVISION_NAME $PROJECT_NAME
} else {
  project_new -revision $REVISION_NAME $PROJECT_NAME
}

set_global_assignment -name FAMILY "Agilex 3"
set_global_assignment -name DEVICE A3CW135BM16AE6S
set_global_assignment -name TOP_LEVEL_ENTITY top
set_global_assignment -name PROJECT_OUTPUT_DIRECTORY output_files

set_global_assignment -name SYSTEMVERILOG_FILE rtl/top.sv
set_global_assignment -name VERILOG_FILE       rtl/led_counter.v

set_global_assignment -name SDC_FILE constraints/timing.sdc

To execute this file from the terminal, we need first to add the Quartus executables to the path. In a similar way that we execute the script settings64.sh to have vivado added to the path, we have a similar script in the Quartus installation folder.

source ~/Applications/altera_pro/26.1/qinit.sh

After executing this script, we have access to tools like quartus_sh for a terminal based application, quartus for the gui application, or quartus_pgm for programming.

Running quartus_sh -t scripts/create_project.tcl creates the .qpf and .qsf files. From there, open the project in Quartus Prime Pro 26.1 and click on Start Compilation.

The top-level module instantiates a led_counter module that divides the 100 MHz clock down to a 100 ms tick and drives a 2-bit counter:

module top (
  input hvio_pllrefclk,
  input io96_3a_pb1_fpga_rst_n,
  output io96_3a_led0,
  output io96_3a_led1
);

  wire [1:0] led_count;

  led_counter u_led_counter (
    .aclk(hvio_pllrefclk),
    .resetn(io96_3a_pb1_fpga_rst_n),
    .led(led_count)
  );

  assign io96_3a_led0 = led_count[0];
  assign io96_3a_led1 = led_count[1];

endmodule

The pin assignments for this project are straightforward:

Signal	Pin	IO Standard	Description
`hvio_pllrefclk`	AJ27	3.3-V LVCMOS	100 MHz reference clock
`io96_3a_pb1_fpga_rst_n`	M1	1.1 V	Active-low reset (pushbutton PB1)
`io96_3a_led0`	K1	1.1 V	LED0 — counter bit 0
`io96_3a_led1`	L2	1.1 V	LED1 — counter bit 1

After compilation, the bitstream lands in output_files/. To program from the command line:

quartus_pgm -m jtag -o "output_files/agilex3_base.sof"

If the LEDs start cycling through the four binary states, the toolchain is working end to end and it is time to move on to something more interesting.

Native floating-point DSP on Agilex 3

Most FPGAs handle DSP workloads in fixed-point: you carefully choose your Q format, manage bit-widths at every stage, and pray the rounding errors stay acceptable. The Agilex 3 family offers an alternative. The Variable Precision DSP blocks in the fabric include a native IEEE 754 single-precision (FP32) mode that performs multiply-accumulate operations in hardware.

The specific IP used in this project is the Native Floating Point DSP Agilex FPGA IP, which maps to the tennm_fp_mac primitive. It is important to select the correct IP from the catalog — it lives under DSP → Primitive DSP, not under the soft Floating Point Functions:

Quartus IP Catalog showing the Native Floating Point DSP Agilex FPGA IP selected under DSP → Primitive DSP

The IP Parameter Editor shows the configuration used in this project: fp32_mult_add operation mode, with chain-in and chain-out disabled:

Native Floating Point DSP IP Parameter Editor configured in fp32_mult_add mode

This is not the soft Floating Point Functions FPGA IP that synthesises FP operations from LUTs; it is a hard silicon path inside the DSP block. The difference matters: the native IP runs at full clock speed with deterministic latency and consumes no fabric logic for the arithmetic itself.

The internal architecture of the DSP block in multiply-add mode is shown in the following diagram from the Altera documentation. You can see the input register bank, the FP32 multiplier with its pipeline stages, the adder, and the output register — these are the five pipeline stages that give the 5-cycle latency:

Multiply-add mode block diagram of the Variable Precision DSP block

In fp32_mult_add mode, the IP computes:

\[\text{result} = a \times b + c\]

Where \(a\), \(b\), and \(c\) are all IEEE 754 FP32 values. With all pipeline registers enabled (ena = 3'b111), the latency is 5 clock cycles: input register, multiplier pipeline stage 1, multiplier pipeline stage 2, adder input register, and output register. This predictable latency makes it straightforward to build time-shared architectures around it.

Project 2 — FP32 audio crossover

With the toolchain validated, the second project puts the native FP DSP blocks to real use. The goal is to build a three-band Linkwitz-Riley order-3 (LR3) FIR audio crossover that splits a stereo audio stream into bass, mid, and high frequency bands, all processed in FP32. Audio I/O is handled by a Digilent Pmod I2S2 module connected to the PMOD compatible connector on the dev board.

Linkwitz-Riley crossover basics

A Linkwitz-Riley crossover is a standard technique in professional audio to split a signal into frequency bands without introducing amplitude dips or peaks at the crossover frequency. An LR crossover of order \(N\) is formed by cascading two Butterworth filters of order \(N/2\) (for even orders), or by using an equivalent design that produces the characteristic flat magnitude sum.

This design approximates LR3 behaviour using linear-phase FIR filters with 63 taps and a Kaiser window (\(\beta = 8.6\)). Three separate filter bands cover the audio spectrum:

Band	Pass region	Taps	FP32 ROM entries
BASS	0 – 250 Hz	63	`BASS[0..62]`
MID	250 Hz – 3 kHz	63	`MID[0..62]`
HIGH	3 kHz – 24 kHz	63	`HIGH[0..62]`

The advantage of FIR filters here is that they preserve linear phase, which means all three bands stay time-aligned at the output. With the LR magnitude response, the three bands sum flat across the spectrum — exactly what you want in a crossover.

System architecture

The following diagram shows the complete signal path, highlighting the boundary between the 24-bit fixed-point I/O stages and the single-precision floating-point processing core:

Block diagram of the audio crossover project showing fixed-point and floating-point domains

Audio arrives from the CS5343 ADC on the Pmod I2S2 as a 24-bit I2S stream. The i2s_rx module deserialises it into left and right 24-bit samples. These are converted to FP32, fed through independent left/right crossover instances, mixed with per-band gain control, converted back to 24-bit fixed-point, and serialised out to the CS4344 DAC.

The module hierarchy reflects this signal flow:

top.sv
├── mclk_gen.v            — generates MCLK, BCLK, LRCK from 100 MHz
├── i2s_rx.v              — receives 24-bit L+R samples from CS5343
├── fixed24_to_fp32.v     — converts signed 24-bit to IEEE 754 FP32
├── lr3_crossover.v       — 3-band FIR crossover (×2: left + right)
│   ├── fir_fp32.v        — 63-tap time-shared FP32 FIR (×3 per channel)
│   │   └── native_fp     — Agilex Native FP DSP IP (×3 per channel)
│   └── coeff_rom.v       — band-selected coefficient ROM (×3 per channel)
├── band_mixer.v          — per-band FP32 gain + sum (×2: left + right)
├── fp32_to_fixed24.v     — converts FP32 back to signed 24-bit
└── i2s_tx.v              — serialises 24-bit L+R to CS4344

FP32 FIR filter implementation

Each FIR filter instance (fir_fp32) is a direct-form structure with a 63-element delay line. Instead of instantiating 63 multiplier-accumulators, the design time-shares a single native_fp instance across all taps. The accumulator feedback loop works as follows: the FP32 result of the previous tap feeds back as the fp32_adder_a input of the next multiply-accumulate, stored in a 32-bit register. There is no DSP chain-in/chain-out between blocks; a single IP instance processes one tap every 6 clock cycles.

The core MAC instantiation in fir_fp32.v:

native_fp u_mac (
  .fp32_mult_a  (delay_line[tap_idx]),
  .fp32_mult_b  (coeff_data),
  .fp32_adder_a (acc),
  .clk          (aclk),
  .ena          (3'b111),
  .fp32_result  (mac_result)
);

The state machine walks through taps 0 to 62, waiting 5 cycles for the pipeline to produce each result, then capturing it into acc and advancing to the next tap. The total processing time per sample is 63 taps × 6 cycles = 378 clock cycles. At 100 MHz with a sample rate of ~48.8 kHz, there are approximately 2,048 clock cycles available per sample, so the filter uses about 18% of the budget. Three parallel filters (bass, mid, high) run simultaneously, each on its own native_fp instance, so the per-channel budget stays at 378 cycles.

if (phase == LATENCY[2:0]) begin
  acc <= mac_result;
  phase <= 3'd0;
  if (tap_idx == TAPS - 1) begin
    computing <= 1'b0;
    sample_out <= mac_result;
    out_valid <= 1'b1;
  end else begin
    tap_idx <= tap_idx + 6'd1;
  end
end else begin
  phase <= phase + 3'd1;
end

The FIR coefficients are generated by a Python script, which uses SciPy’s firwin function with a Kaiser window. The script outputs Verilog localparam hex literals for BASS, MID, and HIGH bands that are pasted directly into coeff_rom.v. Regenerating coefficients is a one-liner:

Band mixer and gain control

After the crossover, a band_mixer module applies an individual FP32 gain to each band and sums them:

\[\text{mix} = \text{bass} \times G_{\text{bass}} + \text{mid} \times G_{\text{mid}} + \text{high} \times G_{\text{high}}\]

I2S audio interface and clocking

The Pmod I2S2 contains a Cirrus Logic CS5343 ADC and CS4344 DAC, both of which expect I2S timing signals from the FPGA. The mclk_gen module divides the 100 MHz system clock to produce the required audio clocks:

Clock	Division	Frequency
MCLK	÷4	12.5 MHz
BCLK	÷16	3.125 MHz
LRCK	÷1024	~48.8 kHz

These clocks are output through the PMOD compatible connector. The Pmod I2S2 module is wired to the header as follows:

HAT GPIO	Signal	Pmod I2S2 pin
GPIO20	MCLK	J1-1
GPIO21	LRCK	J1-2
GPIO22	DAC_SDATA	J1-3
GPIO24	BCLK	J1-4
GPIO25	ADC_SDATA	J1-7
GND	GND	J1-5 / J1-11
3.3V	VCC	J1-6 / J1-12

Timing analysis — clock launch delay

When I first reviewed the timing report, I noticed a clock launch delay of more than 4.5 ns on some paths. That is a large number, and at first I suspected the global clock buffers were not being used — a large clock network delay often means the signal is routed through local interconnect instead of the dedicated clock tree.

Timing analysis waveform showing a clock launch delay of 4.548 ns

I added constraints in the SDC file to force the use of global clock paths, but the delay did not decrease. Looking at the data path detail, we can see how those 4.5 ns are distributed — and notably, the input buffer for hvio_pllrefclk has an excessive fanout of more than 4,000 nodes:

Data path detail showing the clock launch delay breakdown and fanout of 4259 on the input clock buffer

After further investigation I realised the large launch delay was dominated by the input pin delay, because the clock enters through an I/O pad rather than a PLL input. If you look at the internal register-to-register paths, the clock delay is around 0.05 ns, which confirms the global clock tree is working correctly. The take-away: a few nanoseconds of clock insertion delay at the I/O boundary is normal for Agilex 3 when using a pad-sourced clock, and it does not indicate a routing issue.

Verifyig the use of DSP blocks, not LUTs

The whole point of using the native_fp IP is to push the floating-point arithmetic into dedicated DSP silicon rather than consuming fabric LUTs. After compilation, the Quartus resource usage report confirms that the FP32 MAC operations are mapped to Variable Precision DSP blocks and not to ALMs. The design instantiates 12 native_fp instances (3 FIR filters × 2 stereo channels + 3 Gains × 2 stereo channels ), each occupying one DSP block. The coefficient ROMs, delay lines, and control logic use a modest amount of ALMs and M20K memory, but the heavy computation runs entirely in the DSP hard blocks.

This is a significant advantage over soft floating-point IP. A soft FP32 multiplier typically consumes hundreds of ALMs and runs at a lower clock frequency. The native DSP approach uses zero ALMs for the arithmetic, runs at the full fabric clock, and has deterministic latency.

Conclusions

I always say that, if you know how to use one FPGA, you know how to use every FPGA. Vivado, Quartus, Libero SoC, Diamond — all of them are different, but in the end, all of them have VHDL/Verilog source code, a constraints file for clock constraints, pin constraints, a way to draw the circuit, a set of IPs that can be added to the drawing or instantiated into the source code. Each one has its particularities, yes, but it is not too hard to switch between one and another.

As I mentioned at the beginning of the article, this is the first time I have used an Agilex FPGA, and I have to say that I like them a lot. They are pretty new FPGAs — the Agilex family was born in 2023 and became available in 2024 — and this youth means they include some interesting features from the start.

The floating-point DSP is one of them, but executing this kind of operations is not the only thing these blocks can do. They are optimised to work in combination, so they can execute matrix multiplications in a few cycles — an operation that is very important in the AI field.

The board I have features an SoC that includes a Hard Processor System (HPS), a dual-core Arm Cortex-A55, so in the next articles we will take a look at it.