卷积函数的FPGA实现（二）卷积的相乘累加单元的实现

冷不防 2022-04-13 05:11 237阅读 0赞

**背景：**已经实现了卷积操作的权重与数据从DRAM到BRAM上软件的仿真。现在需要实现处理单元的实现。

**目的：**编写卷积IPcore的处理单元。

**目录**

一、循环嵌套及子函数的顺序

二、processAll\_channelOut

2.1 函数功能

2.2 主程序之中的嵌套

三、processInputChannel

3.1 函数的实现

3.2 程序之中嵌套的位置

--------------------

# 一、循环嵌套及子函数的顺序 #

嵌套顺序为下面顺序，原来为嵌套for循环，而我们改为了两个循环函数。分别为processAll\_channelOut， processInputChannel，这两个函数与相应的循环对应关系为下面所示。

初始化层信息，一次性的读入权重，读取前两行像素
    for row in height{
      加载图像的行
      for col in width{
          for channel_in{
             processInputChannel:{
                 读入pixel_buffer
                 processAll_channelOut:{
                       for channel_out{
                             读入Weight_buffer
                             MACC
                             OBRAM的累加
                       }
                 }//end processAll_channelOut function
              }//end processInputChannel function
           }//end channel_in loop
              for channel_out{
                 加bias，加ReLU
                 将OBRAM写回DRAM
              }
      }//end col loop
    }//end row loop

# 二、processAll\_channelOut #

## 2.1 函数功能 ##

在for channel out循环之中，封装了三个功能：

*          读入Weight\_buffer
 *          MACC
 *          OBRAM的累加

void ProcessingElement::processAll_channelOut(const int out_Channel_Num, const int cur_ci,const float pixel_buffer[9]){
    #pragma HLS unroll factor = N_PE
    #pragma HLS PIPELINE II = 1
    L_CH_OUT:
      for(int cur_co=0;cur_co<out_Channel_Num;cur_co++){
    	float result,weights_local[9];
    #pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0	
    	// fetch weights
        WeightsCache::get_9_weights_to_buffer(cur_ci,cur_co,weights_local);
    	//MACC 3*3  multiply accumulate
    	ProcessingElement::macc2d(pixel_buffer,weights_local,result);
    	//accumulate 3*3 macc result in OBRAM
    	if (cur_ci == 0) {
    		OutputCache::setOutChannel(cur_co, result);
    	} else {
    		OutputCache::accumulateChannel(cur_co, result);
    	}
      }
    };

## 2.2 主程序之中的嵌套 ##

//...外面几层for循环之中的for channel in循环
    for(cur_channel_in=0;cur_channel_in<ImageCache::in_ChannelNum;cur_channel_in++){//in_channel
    	float pixel_buffer[9];
    	ProcessingElement::loadPixel_buffer(cur_row_out*stride, stride*cur_col_out,cur_channel_in, pixel_buffer);
    	ProcessingElement::processAll_channelOut(MemoryController::out_channelNum, cur_channel_in,pixel_buffer);
    }//channel_in loop

# 三、processInputChannel #

for in\_channel的循环之中有processInputChannel函数，该函数内实现：

*    加载当前channel的pixel\_buffer
 *    processAll\_channelOut

## 3.1 函数的实现 ##

void ProcessingElement::processInputChannel(const int cur_row_times_stride,const int cur_col_times_stride,const int cur_ci, const int out_channelNum){
    #pragma HLS inline off
    #pragma HLS FUNCTION_INSTANTIATE variable = cur_ci
    #pragma HLS dataflow	
    	int cur_channel_in=cur_ci;
    	float pixel_buffer[9];
    	ProcessingElement::loadPixel_buffer(cur_row_times_stride, cur_col_times_stride,cur_channel_in, pixel_buffer);
    	ProcessingElement::processAll_channelOut(out_channelNum, cur_channel_in,pixel_buffer);
    };

## 3.2 程序之中嵌套的位置 ##

for(cur_channel_in=0;cur_channel_in<ImageCache::in_ChannelNum;cur_channel_in++){//in_channel
    	
    	//process input channel(process output channel)
    	ProcessingElement::processInputChannel(cur_row_out*stride,cur_col_out*stride,
    						cur_channel_in, MemoryController::out_channelNum);
    }//channel_in loop

至此，我们完全将我们的IPcore的c代码端改为与zynqNet一致的结构。下面我们需要对相应的代码进行HLS预编译指令的优化。