目录

1.FPGA定点运算

2.卷积神经网络的FPGA实现

2.1 卷积层原理与FPGA实现

2.2 激活函数层原理与FPGA实现

2.3 Sigmoid函数的分段线性近似与FPGA实现

2.4 最大池化层原理与FPGA实现

2.5 全连接层原理与FPGA实现

2.6 Batch Normalization层原理与FPGA实现


       随着人工智能技术的飞速发展,深度学习和强化学习已经在图像识别、自然语言处理、自动驾驶、机器人控制等领域取得了突破性进展。然而,传统的GPU和CPU平台在部署这些模型时,往往面临功耗高、延迟大、体积大等问题,难以满足边缘计算和实时推理的需求。FPGA凭借其高度并行的计算架构、可重构性、低功耗和低延迟等优势,成为部署AI模型的理想硬件平台。本文将以卷积神经网络(CNN)作为深度学习的例阐述如何在FPGA上实现CNN模型。

1.FPGA定点运算

在FPGA中,浮点运算消耗大量资源,因此通常采用定点数表示。例如,使用16位定点数,其中1位符号位、7位整数位、8位小数位(Q7.8格式)

Verilog中,定点数直接用有符号寄存器表示:

// 定点数定义:16位,Q7.8格式
// 符号位1位 + 整数部分7位 + 小数部分8位
reg signed [15:0] fixed_point_value;

// 定点数乘法:两个Q7.8数相乘后需右移8位截断
module fixed_multiply (
    input  signed [15:0] a,      // Q7.8
    input  signed [15:0] b,      // Q7.8
    output signed [15:0] result  // Q7.8
);
    wire signed [31:0] mult_full;
    assign mult_full = a * b;           // 32位全精度结果
    assign result = mult_full[23:8];    // 右移8位,取回Q7.8格式
endmodule

定点数加法相对简单,同格式的定点数可直接相加:

Verilog如下:

module fixed_add (
    input  signed [15:0] a,
    input  signed [15:0] b,
    output signed [15:0] result
);
    wire signed [16:0] sum_full;
    assign sum_full = a + b;
    // 饱和处理,防止溢出
    assign result = (sum_full > 17'sd32767)  ? 16'sd32767 :
                    (sum_full < -17'sd32768) ? -16'sd32768 :
                    sum_full[15:0];
endmodule

2.卷积神经网络的FPGA实现

       卷积神经网络是深度学习中最经典的网络架构之一,广泛应用于图像分类、目标检测等任务。一个典型的CNN包含以下层次:卷积层(Convolution Layer)、激活函数层(Activation Layer)、池化层(Pooling Layer)和全连接层(Fully Connected Layer)。

整个系统的流程图如下图所示:

2.1 卷积层原理与FPGA实现

卷积层是CNN的核心运算。对于二维卷积,设输入特征图为I,卷积核为K,输出特征图为O,偏置为b,则卷积运算定义为:

其中,Kh和Kw分别是卷积核的高度和宽度。对于多通道情况,假设输入有Cin​个通道,输出有 Cout个通道,则:

以最常见的3×3卷积核为例,实现单通道卷积运算。卷积本质上是乘累加操作:

module conv3x3 (
    input  wire        clk,
    input  wire        rst_n,
    input  wire        valid_in,
    // 3x3输入窗口数据(Q7.8定点数)
    input  signed [15:0] pixel_00, pixel_01, pixel_02,
    input  signed [15:0] pixel_10, pixel_11, pixel_12,
    input  signed [15:0] pixel_20, pixel_21, pixel_22,
    // 3x3卷积核权重(Q7.8定点数)
    input  signed [15:0] weight_00, weight_01, weight_02,
    input  signed [15:0] weight_10, weight_11, weight_12,
    input  signed [15:0] weight_20, weight_21, weight_22,
    // 偏置
    input  signed [15:0] bias,
    // 输出
    output reg signed [15:0] conv_out,
    output reg         valid_out
);

    // 乘法结果(32位全精度)
    wire signed [31:0] mult [0:8];
    
    // 9个并行乘法器
    assign mult[0] = pixel_00 * weight_00;
    assign mult[1] = pixel_01 * weight_01;
    assign mult[2] = pixel_02 * weight_02;
    assign mult[3] = pixel_10 * weight_10;
    assign mult[4] = pixel_11 * weight_11;
    assign mult[5] = pixel_12 * weight_12;
    assign mult[6] = pixel_20 * weight_20;
    assign mult[7] = pixel_21 * weight_21;
    assign mult[8] = pixel_22 * weight_22;
    
    // 加法树结构实现累加(流水线第一级)
    wire signed [31:0] sum_level1 [0:3];
    assign sum_level1[0] = mult[0] + mult[1];
    assign sum_level1[1] = mult[2] + mult[3];
    assign sum_level1[2] = mult[4] + mult[5];
    assign sum_level1[3] = mult[6] + mult[7];
    
    // 加法树第二级
    wire signed [31:0] sum_level2 [0:1];
    assign sum_level2[0] = sum_level1[0] + sum_level1[1];
    assign sum_level2[1] = sum_level1[2] + sum_level1[3];
    
    // 加法树第三级
    wire signed [31:0] sum_level3;
    assign sum_level3 = sum_level2[0] + sum_level2[1];
    
    // 最终累加:加上第9个乘法结果和偏置
    wire signed [31:0] sum_final;
    assign sum_final = sum_level3 + mult[8] + (bias <<< 8); // 偏置对齐到Q14.16
    
    // 截断回Q7.8格式,并进行饱和处理
    wire signed [15:0] result_truncated;
    assign result_truncated = sum_final[23:8];
    
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            conv_out  <= 16'd0;
            valid_out <= 1'b0;
        end else begin
            conv_out  <= result_truncated;
            valid_out <= valid_in;
        end
    end

endmodule

2.2 激活函数层原理与FPGA实现

ReLU是最常用的激活函数,其数学表达式为:

ReLU函数在FPGA上实现极为高效,只需要检查符号位即可。

程序如下:

module relu (
    input  wire signed [15:0] data_in,
    output wire signed [15:0] data_out
);
    // 当符号位为1(负数)时输出0,否则输出原值
    assign data_out = data_in[15] ? 16'd0 : data_in;
endmodule

2.3 Sigmoid函数的分段线性近似与FPGA实现

Sigmoid函数定义为:

由于Sigmoid涉及指数运算和除法,在FPGA上直接实现非常复杂。常用的方法是分段线性近似:

程序如下:

module sigmoid_pwl (
    input  wire signed [15:0] data_in,   // Q7.8
    output reg  signed [15:0] data_out   // Q7.8, 输出范围[0,1]
);
    // 常数定义(Q7.8格式)
    localparam signed [15:0] NEG5     = -16'sd1280;  // -5.0 * 256
    localparam signed [15:0] NEG2_375 = -16'sd608;   // -2.375 * 256
    localparam signed [15:0] POS2_375 =  16'sd608;
    localparam signed [15:0] POS5     =  16'sd1280;
    localparam signed [15:0] ONE      =  16'sd256;   // 1.0 * 256
    localparam signed [15:0] HALF     =  16'sd128;   // 0.5 * 256
    
    wire signed [31:0] mult_result;
    
    always @(*) begin
        if (data_in <= NEG5) begin
            data_out = 16'd0;
        end else if (data_in <= NEG2_375) begin
            // 0.03125 * x + 0.15625
            // 0.03125 = 8 in Q7.8, 0.15625 = 40 in Q7.8
            data_out = ((data_in * 16'sd8) >>> 8) + 16'sd40;
        end else if (data_in <= POS2_375) begin
            // 0.125 * x + 0.5
            // 0.125 = 32 in Q7.8, 0.5 = 128 in Q7.8
            data_out = ((data_in * 16'sd32) >>> 8) + HALF;
        end else if (data_in < POS5) begin
            // 0.03125 * x + 0.84375
            // 0.84375 = 216 in Q7.8
            data_out = ((data_in * 16'sd8) >>> 8) + 16'sd216;
        end else begin
            data_out = ONE;
        end
    end
endmodule

2.4 最大池化层原理与FPGA实现

最大池化对每个池化窗口取最大值,以2×2窗口、步长为2为例:

最大池化能够保留区域内最显著的特征,同时将特征图尺寸缩小为原来的一半。

最大池化的Verilog实现如下:

module max_pool_2x2 (
    input  wire signed [15:0] pixel_00,
    input  wire signed [15:0] pixel_01,
    input  wire signed [15:0] pixel_10,
    input  wire signed [15:0] pixel_11,
    output wire signed [15:0] pool_out
);
    wire signed [15:0] max_row0, max_row1;
    
    // 第一行取最大值
    assign max_row0 = (pixel_00 > pixel_01) ? pixel_00 : pixel_01;
    // 第二行取最大值
    assign max_row1 = (pixel_10 > pixel_11) ? pixel_10 : pixel_11;
    // 两行结果再取最大值
    assign pool_out = (max_row0 > max_row1) ? max_row0 : max_row1;

endmodule

2.5 全连接层原理与FPGA实现

全连接层本质上是矩阵向量乘法加偏置:

其中x是输入向量(长度为N),w是权重矩阵,b是偏置向量,y是输出向量。

以一个输入维度为4、输出维度为1的简单全连接神经元为例:

module fully_connected #(
    parameter INPUT_DIM = 4
)(
    input  wire        clk,
    input  wire        rst_n,
    input  wire        start,
    input  wire signed [15:0] x [0:INPUT_DIM-1],  // 输入向量
    input  wire signed [15:0] w [0:INPUT_DIM-1],  // 权重向量
    input  wire signed [15:0] bias,                 // 偏置
    output reg  signed [15:0] y,                    // 输出
    output reg         done
);

    reg [2:0] state;
    reg [15:0] idx;
    reg signed [31:0] acc;  // 累加器
    
    localparam IDLE    = 3'd0;
    localparam COMPUTE = 3'd1;
    localparam OUTPUT  = 3'd2;
    
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            state <= IDLE;
            acc   <= 0;
            y     <= 0;
            done  <= 0;
            idx   <= 0;
        end else begin
            case (state)
                IDLE: begin
                    done <= 0;
                    if (start) begin
                        acc   <= 0;
                        idx   <= 0;
                        state <= COMPUTE;
                    end
                end
                
                COMPUTE: begin
                    // 乘累加:acc += x[idx] * w[idx]
                    acc <= acc + (x[idx] * w[idx]);
                    if (idx == INPUT_DIM - 1) begin
                        state <= OUTPUT;
                    end else begin
                        idx <= idx + 1;
                    end
                end
                
                OUTPUT: begin
                    // 加偏置并截断
                    y    <= (acc + (bias <<< 8)) >>> 8;
                    done <= 1;
                    state <= IDLE;
                end
            endcase
        end
    end

endmodule

2.6 Batch Normalization层原理与FPGA实现

Batch Normalization(批归一化)在推理时可简化为线性变换:

其中,γ^和 β^在训练完成后是固定常数,因此在FPGA推理中只需实现一次乘加运算。

Verilog实现如下:

module batch_norm (
    input  wire signed [15:0] data_in,
    input  wire signed [15:0] gamma_hat,  // 预计算的缩放因子
    input  wire signed [15:0] beta_hat,   // 预计算的偏移量
    output wire signed [15:0] data_out
);
    wire signed [31:0] scaled;
    assign scaled = data_in * gamma_hat;
    assign data_out = scaled[23:8] + beta_hat;
endmodule

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐