【端侧部署系列-CLIP】CLIP部署至全志开发板

troyteng

335人浏览 · 2026-04-16 17:53:16

troyteng · 2026-04-16 17:53:16 发布

注：awnpu_model_zoo\docs里有详细的开发文档以及参考指南

本文是根据《NPU开发环境部署参考指南》，部署PC的ubuntu环境，使用Docker镜像环境为例进行说明。

如果想对部署流程进行更加详细的了解，可以参考《NPU_模型部署_开发指南》

重要说明：在awnpu_model_zoo内没有的模型，可参考awnpu_model_zoo\examples里面对应系列的模型进行修改部署，需要自行编写对应模型cpp前后处理代码以及修改相应的配置文件(如model_config.h、CMakeLists.txt等)。如果遇到模型导出量化失败的情况，需要对模型进行相应的裁剪、或者采用不同的量化方式等。本文的CLIP已经在板端经过了推理验证。

clip目录结构如下：

|-- CMakeLists.txt
|-- README.md
|-- clip-images_pre.cpp
|-- clip_post.cpp
|-- clip_tokenizer.cpp
|-- clip_tokenizer.h
|-- figures
| |-- 1.png
| `-- 2.png
|-- images_convert_model # 图像编码器
| |-- config_yml.py # 图像模型的配置文件
| `-- convert_model_env.sh
|-- main.cpp
|-- model
| |-- demo.png
| |-- demo.txt
| `-- merges.txt
|-- model_config.h
`-- text_convert_model # 文本编码器
|-- config_yml.py # 文本模型的配置文件
|-- convert_model_env.sh
`-- python
|-- onnx_fixed.py
`-- truncated_onnx.py

环境配置

关于环境配置所有模型都是一样的，采用Docker容器的方式，具体关于参考之前部署yolox的文章【端侧部署yolo系列】yolox部署至全志开发板T736https://blog.csdn.net/troyteng/article/details/155444386?spm=1011.2124.3001.6209

下载镜像文件和AWNPU_Model_Zoo，创建自己的容器。

模型准备

官方原始pytorch模型

https://huggingface.co/openai/clip-vit-base-patch32/tree/mainhttps://huggingface.co/openai/clip-vit-base-patch32/tree/main如不想进行后续模型准备步骤，这里已经将官方的模型导出为了onnx模型clip-images、clip-text

模型导出

安装从https://huggingface.co/openai/clip-vit-base-patch32 文件库导出onnx模型需要的环境，执行以下命令导出onnx模型（这里的onnx是包含了图像编码和文本编码，后续需要将两个模型拆分出来）：

cd examples/clip
optimum-cli export onnx --model ./clip-vit-base-patch32/ --task feature-extraction --opset 18 ./clip_onnx/

#将从开源项目下载的权重相关文件放在clip-vit-base-patch32下
#执行导出命令后生成的model.onnx在clip_onnx文件下

利用truncated_onnx.py从model.onnx模型中裁剪出图像模型（clip_images.onnx）和文本模型（clip_text.onnx）；

cd examples/clip/text_convert_model/python
mkdir model
python truncated_onnx.py --model ../../clip_onnx/model.onnx

#得到图像模型clip_images.onnx和文本模型clip_text.onnx

模型图优化

模型输入被固定为 (1, 20)，即batch_size=1，序列长度=20。当输入长度固定且所有token都有效时，attention_mask就是一个全1张量，没有动态变化的必要。简化了NPU部署复杂度并提升了推理效率。代价是模型只能处理固定长度的有效输入序列。

利用onnx-modify（链接里有详细的工具使用指导）从clip-text模型中移除attention_mask 输入，即移除图中的节点及where节点12个输出关联的12 add节点。

在clip环境中安装库

# 安装依赖
pip install onnx-modifier

# 运行
onnx-modifier

输出结果如下：

* Serving Flask app 'onnx_modifier.app'
* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000
INFO:werkzeug:Press CTRL+C to quit

进入 http://127.0.0.1:5000

将导出的文本模型（clip_text.onnx）拖入，找到attention_mask输入，然后找到where节点，点击where节点，删除单个节点。

然后删除attention_mask的所有子节点，其他节点同理。

由于已经把where节点删除，导致下一个节点add算子无效，因此12个输出关联的add节点也需要删除。具体操作：找到add的输入Expand，将Expand输出改为add节点的输出（类似于删除链表），也就是将步骤2这里的name改为Slice的name，其他add节点同理。

最后打钩clean up,点击Download。模型会保存至指定路径（下载onnx-modifier库路径下的modified_onnx文件夹内）。

固定尺寸

#将固定模型输入序列
cd examples/clip/text_convert_model/python/model
#clip_text
python -m onnxruntime.tools.make_dynamic_shape_fixed --input_name input_ids --input_shape 1,20 clip_text.onnx clip_text-fixed.onnx

#clip_images
python -m onnxruntime.tools.make_dynamic_shape_fixed --input_name pixel_values --input_shape 1,3,224,224 clip_images.onnx clip_images-fixed.onnx

#模型简化
#clip_text
python -m onnxsim clip_text-fixed.onnx clip_text-sim.onnx --overwrite-input-shape=1,20

#clip_images
python -m onnxsim clip_images-fixed.onnx ../../../images_convert_model/clip-images.onnx --overwrite-input-shape=1,3,224,224

固定模型尺寸后，通过onnx_fixed.py修改add输入产量，将-3.4028234663852886e+38这个值修改为-10000，提高后续的量化精度。

这样做的目的是极小值在量化时容易造成精度损失。在Transformer的Self-Attention中，attention_mask的作用是在计算attention分数时，给padding位置加上一个极大的负数（如 -3.4e38），这样经过softmax后padding位置的权重趋近于0。为了避免这种情况，将极小值 -3.4e38改为 -10000，-10000对于softmax来说已经足够使得padding位置权重趋近于0，对量化更友好。

cd examples/clip/text_convert_model/python
python onnx_fixed.py

量化数据集

数据集来源OpenAI关于clip的博客页面，链接：https://openai.com/index/clip/，从网页挑选其中10张图片与描述作为量化数据集，本文使用的数据集如下：

a photo of a motorcycle
a photo of guacamole
a photo of a television studio
a photo of a building
a photo of a airplane
a photo of a barn
a photo of a kangaroo
a photo of a beer bottle
a photo of a pill bottle
a photo of a horse

将数据集读进clip的python推理代码中，获取input_id部分的输入张量，将其转换为.npy文件，即可获得clip-text量化数据集。

模型配置

clip-images

cd ./images_convert_model/

修改config_yml.py文件的相关参数配置；

# "database" allowed types: "TEXT, NPY, H5FS, SQLITE, LMDB, GENERATOR, ZIP"
DATASET = '../../dataset/clip_10/images/dataset.txt'
DATASET_TYPE = "TEXT"

# mean, scale
MEAN = [122.770938, 116.746013, 104.093736]
SCALE = [0.014598, 0.015007, 0.014220]

# reverse_channel: True bgr, False rgb
REVERSE_CHANNEL = False

# add_preproc_node, True or False
ADD_PREPROC_NODE = True
# "preproc_type" allowed types:"IMAGE_RGB, IMAGE_RGB888_PLANAR, IMAGE_RGB888_PLANAR_SEP, IMAGE_I420,
# IMAGE_NV12,IMAGE_NV21, IMAGE_YUV444, IMAGE_YUYV422, IMAGE_UYVY422, IMAGE_GRAY, IMAGE_BGRA, TENSOR"
PREPROC_TYPE = "IMAGE_RGB"

# add_postproc_node, quant output -> float32 output
ADD_POSTPROC_NODE = True

clip-text

cd ./text_convert_model/

修改config_yml.py文件的相关参数配置；

# "database" allowed types: "TEXT, NPY, H5FS, SQLITE, LMDB, GENERATOR, ZIP"
DATASET = '../../dataset/clip_10/text/dataset.txt'
DATASET_TYPE = "TEXT"

# mean, scale
MEAN = [0, 0, 0]
SCALE = [1, 1, 1]

# reverse_channel: True bgr, False rgb
REVERSE_CHANNEL = False

# add_preproc_node, True or False
ADD_PREPROC_NODE = False
# "preproc_type" allowed types:"IMAGE_RGB, IMAGE_RGB888_PLANAR, IMAGE_RGB888_PLANAR_SEP, IMAGE_I420,
# IMAGE_NV12,IMAGE_NV21, IMAGE_YUV444, IMAGE_YUYV422, IMAGE_UYVY422, IMAGE_GRAY, IMAGE_BGRA, TENSOR"
PREPROC_TYPE = "TENSOR"

# add_postproc_node, quant output -> float32 output
ADD_POSTPROC_NODE = True

模型前后处理

配置文件

model_config.h

/****************************************************************************
*  model config header file
****************************************************************************/
#ifndef _MODEL_CONFIG_H_
#define _MODEL_CONFIG_H_

#include <iostream>
#include <vector>


/* 224 x 224 */
#define LETTERBOX_ROWS 224
#define LETTERBOX_COLS 224

#define SEQ_LEN 20
#define FEAT_DIM 512
#define MAX_TEXT_LINE_LENGTH 1024
#define MAX_TEXT_COUNT 100

#define MERGES_TXT_PATH "./model/merges.txt"

#endif

clip_tokenizer.h

#ifndef _CLIP_TOKENIZER_H
#define _CLIP_TOKENIZER_H

#include <string>
#include <regex>
#include <set>
#include <codecvt>
#include <locale>
#include <map>



const int UNK_TOKEN_ID = 49407;
const int BOS_TOKEN_ID = 49406;
const int EOS_TOKEN_ID = 49407;
const int PAD_TOKEN_ID = 49407;

std::u32string utf8_to_utf32(const std::string& utf8_str);
std::string utf32_to_utf8(const std::u32string& utf32_str);
std::u32string unicode_value_to_utf32(int unicode_value);


class CLIPTokenizer {
private:
    std::map<int, std::u32string> byte_encoder;
    std::map<std::u32string, int> encoder;
    std::map<std::pair<std::u32string, std::u32string>, int> bpe_ranks;
    std::string merges_utf8_str;

public:
    CLIPTokenizer() {
    }

    void load_from_merges(const std::string& merges_utf8_str);
    bool load_from_file(const std::string& merges_file_path);

    std::u32string bpe(const std::u32string& token);

    std::vector<int> tokenize(std::string text, size_t max_length = 0, bool padding = false);

    std::vector<int> encode(std::string text);

};

#endif

图像前处理clip-images_pre.cpp

#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <iostream>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <chrono>

#include "model_config.h"

/* model_inputmeta.yml file param modify, eg:

    preproc_node_params:
      add_preproc_node: true
      preproc_type: IMAGE_RGB


demo model: model_rgb_xxx.nb.
*/

void get_input_data(const char* image_file, unsigned char* input_data, int letterbox_rows, int letterbox_cols)
{
    std::chrono::steady_clock::time_point Tbegin, Tend;
    Tbegin = std::chrono::steady_clock::now();

    cv::Mat sample = cv::imread(image_file, 1);
    if (sample.empty()) {
        fprintf(stderr, "cv::imread %s failed\n", image_file);
        return;
    }

    Tend = std::chrono::steady_clock::now();
    float f = std::chrono::duration_cast<std::chrono::milliseconds>(Tend - Tbegin).count();

//    std::cout << "preprocess cv::imread image file cost time : " << f << " ms" << std::endl;


    cv::Mat img;
    cv::cvtColor(sample, img, cv::COLOR_BGR2RGB);


    /* letterbox process to support different letterbox size */
    float scale_letterbox = 1.f;
    if ((letterbox_rows * 1.0 / img.rows) < (letterbox_cols * 1.0 / img.cols))
    {
        scale_letterbox = letterbox_rows * 1.0 / img.rows;
    }
    else
    {
        scale_letterbox = letterbox_cols * 1.0 / img.cols;
    }
    int resize_cols = int(round(scale_letterbox * img.cols));
    int resize_rows = int(round(scale_letterbox * img.rows));

    float dh = (float)(letterbox_rows - resize_rows);
    float dw = (float)(letterbox_cols - resize_cols);

    dh /= 2.0f;
    dw /= 2.0f;

    cv::resize(img, img, cv::Size(resize_cols, resize_rows));

    // create a mat with input_data ptr
    cv::Mat img_new(letterbox_rows, letterbox_cols, CV_8UC3, input_data);
    int top   = (int)(round(dh - 0.1));
    int bot   = (int)(round(dh + 0.1));
    int left  = (int)(round(dw - 0.1));
    int right = (int)(round(dw + 0.1));

    // Letterbox filling
    cv::copyMakeBorder(img, img_new, top, bot, left, right, cv::BORDER_CONSTANT, cv::Scalar(114, 114, 114));
}


int clip_images_preprocess(const char* imagepath, void* buff_ptr, unsigned int buff_size)
{
    int img_c = 3;

    // set default letterbox size
    int letterbox_rows = LETTERBOX_ROWS;
    int letterbox_cols = LETTERBOX_COLS;
    int img_size = letterbox_rows * letterbox_cols * img_c;

    unsigned int data_size = img_size * sizeof(uint8_t);

    if (data_size > buff_size) {
        printf("data size > buff size, please check code.\n");
        return -1;
    }

    get_input_data(imagepath, (unsigned char*)buff_ptr, letterbox_rows, letterbox_cols);

    return 0;
}

文本分词器clip_tokenizer.cpp

#include "clip_tokenizer.h"
#include <regex>
#include <algorithm>
#include <set>
#include <map>
#include <iostream>
#include <fstream>
#include <climits>


std::u32string utf8_to_utf32(const std::string& utf8_str) {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
    try {
        return converter.from_bytes(utf8_str);
    } catch (const std::range_error& e) {
        std::cerr << "UTF8 to UTF32 conversion error: " << e.what() << std::endl;
        return U"";
    }
}

std::string utf32_to_utf8(const std::u32string& utf32_str) {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
    try {
        return converter.to_bytes(utf32_str);
    } catch (const std::range_error& e) {
        std::cerr << "UTF32 to UTF8 conversion error: " << e.what() << std::endl;
        return "";
    }
}

std::u32string unicode_value_to_utf32(int unicode_value) {
    std::u32string utf32_string = {static_cast<char32_t>(unicode_value)};
    return utf32_string;
}


std::vector<std::pair<int, std::u32string>> bytes_to_unicode() {
    std::vector<std::pair<int, std::u32string>> byte_unicode_pairs;
    std::set<int> byte_set;


    for (int b = static_cast<int>('!'); b <= static_cast<int>('~'); ++b) {
        byte_set.insert(b);
        byte_unicode_pairs.emplace_back(b, unicode_value_to_utf32(b));
    }

    for (int b = 161; b <= 172; ++b) {
        byte_set.insert(b);
        byte_unicode_pairs.emplace_back(b, unicode_value_to_utf32(b));
    }
    for (int b = 174; b <= 255; ++b) {
        byte_set.insert(b);
        byte_unicode_pairs.emplace_back(b, unicode_value_to_utf32(b));
    }

    int n = 0;
    for (int b = 0; b < 256; ++b) {
        if (byte_set.find(b) == byte_set.end()) {
            byte_unicode_pairs.emplace_back(b, unicode_value_to_utf32(n + 256));
            ++n;
        }
    }
    return byte_unicode_pairs;
}


static std::string strip(const std::string& str) {
    std::string::size_type start = str.find_first_not_of(" \t\n\r\v\f");
    std::string::size_type end   = str.find_last_not_of(" \t\n\r\v\f");

    if (start == std::string::npos) {
        return "";
    }
    return str.substr(start, end - start + 1);
}

static std::string whitespace_clean(std::string text) {
    text = std::regex_replace(text, std::regex(R"(\s+)"), " ");
    text = strip(text);
    return text;
}

static std::set<std::pair<std::u32string, std::u32string>> get_pairs(const std::vector<std::u32string>& subwords) {
    std::set<std::pair<std::u32string, std::u32string>> pairs;
    if (subwords.size() < 2) {
        return pairs;
    }
    std::u32string prev_subword = subwords[0];
    for (size_t i = 1; i < subwords.size(); i++) {
        std::u32string subword = subwords[i];
        pairs.emplace(prev_subword, subword);
        prev_subword = subword;
    }
    return pairs;
}


bool CLIPTokenizer::load_from_file(const std::string& merges_file_path) {
    std::ifstream file(merges_file_path);
    if (!file.is_open()) {
        std::cerr << "Failed to open merges file: " << merges_file_path << std::endl;
        return false;
    }
    std::string merges_str((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
    file.close();
    load_from_merges(merges_str);

    return true;
}


void CLIPTokenizer::load_from_merges(const std::string& merges_utf8_str) {
    auto byte_unicode_pairs = bytes_to_unicode();
    byte_encoder = std::map<int, std::u32string>(byte_unicode_pairs.begin(), byte_unicode_pairs.end());

    std::vector<std::u32string> merges;
    size_t start = 0;
    size_t pos;
    std::u32string merges_utf32_str = utf8_to_utf32(merges_utf8_str);
    while ((pos = merges_utf32_str.find(U'\n', start)) != std::u32string::npos) {
        merges.push_back(merges_utf32_str.substr(start, pos - start));
        start = pos + 1;
    }

    if (!merges.empty()) {
        merges = std::vector<std::u32string>(merges.begin() + 1, merges.end());
    }

    std::vector<std::pair<std::u32string, std::u32string>> merge_pairs;
    for (const auto& merge : merges) {
        size_t space_pos = merge.find(U' ');
        if (space_pos == std::u32string::npos) continue;
        merge_pairs.emplace_back(merge.substr(0, space_pos), merge.substr(space_pos + 1));
    }


    std::vector<std::u32string> vocab;

    for (const auto& pair : byte_unicode_pairs) {
        vocab.push_back(pair.second);
    }

    for (const auto& pair : byte_unicode_pairs) {
        vocab.push_back(pair.second + U"</w>");
    }

    for (const auto& pair : merge_pairs) {
        vocab.push_back(pair.first + pair.second);
    }

    vocab.push_back(U"<|startoftext|>");
    vocab.push_back(U"<|endoftext|>");

    encoder.clear();
    for (size_t i = 0; i < vocab.size(); i++) {
        encoder[vocab[i]] = static_cast<int>(i);
    }


    bpe_ranks.clear();
    for (size_t i = 0; i < merge_pairs.size(); i++) {
        bpe_ranks[merge_pairs[i]] = static_cast<int>(i);
    }
}


std::u32string CLIPTokenizer::bpe(const std::u32string& token) {

    if (token.empty()) return U"";

    std::vector<std::u32string> word;
    for (char32_t c : token) {
        word.emplace_back(1, c);
    }

    if (!word.empty()) {
        word.back() += U"</w>";
    }

    std::set<std::pair<std::u32string, std::u32string>> pairs = get_pairs(word);
    if (pairs.empty()) {
        return token + U"</w>";
    }


    while (true) {

        auto min_pair_iter = pairs.end();
        int min_rank = INT_MAX;
        for (const auto& pair : pairs) {
            auto it = bpe_ranks.find(pair);
            if (it != bpe_ranks.end() && it->second < min_rank) {
                min_rank = it->second;
                min_pair_iter = pairs.find(pair);
            }
        }

        if (min_pair_iter == pairs.end()) break;

        const auto& bigram = *min_pair_iter;
        std::u32string first = bigram.first;
        std::u32string second = bigram.second;


        std::vector<std::u32string> new_word;
        size_t i = 0;
        while (i < word.size()) {
            auto it = std::find(word.begin() + i, word.end(), first);
            if (it == word.end()) {
                new_word.insert(new_word.end(), word.begin() + i, word.end());
                break;
            }
            new_word.insert(new_word.end(), word.begin() + i, it);
            i = std::distance(word.begin(), it);

            if (i < word.size() - 1 && word[i] == first && word[i+1] == second) {
                new_word.push_back(first + second);
                i += 2;
            } else {
                new_word.push_back(word[i]);
                i += 1;
            }
        }

        word = new_word;
        if (word.size() == 1) break;
        pairs = get_pairs(word);
    }


    std::u32string result;
    for (size_t i = 0; i < word.size(); i++) {
        if (i > 0) result += U" ";
        result += word[i];
    }
    return result;
}


std::vector<int> CLIPTokenizer::encode(std::string text) {

    std::vector<int> bpe_tokens;
    text = whitespace_clean(text);
    std::transform(text.begin(), text.end(), text.begin(), [](unsigned char c) {
        return std::tolower(c);
    });

    std::regex pat(R"(<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[a-zA-Z]+|[0-9]+|[^\s\w]+)");
    std::sregex_iterator it(text.begin(), text.end(), pat);
    std::sregex_iterator end;

    for (; it != end; ++it) {
        std::string token_str = it->str();
        std::u32string utf32_token;

        for (char b : token_str) {
            auto encoder_it = byte_encoder.find(static_cast<unsigned char>(b));
            if (encoder_it != byte_encoder.end()) {
                utf32_token += encoder_it->second;
            }
        }

        std::u32string bpe_strs = bpe(utf32_token);
        if (bpe_strs.empty()) continue;

        size_t start = 0;
        size_t pos;
        while ((pos = bpe_strs.find(U' ', start)) != std::u32string::npos) {
            auto bpe_str = bpe_strs.substr(start, pos - start);
            auto token_it = encoder.find(bpe_str);
            if (token_it != encoder.end()) {
                bpe_tokens.push_back(token_it->second);
            }
            start = pos + 1;
        }
        auto bpe_str = bpe_strs.substr(start);
        auto token_it = encoder.find(bpe_str);
        if (token_it != encoder.end()) {
            bpe_tokens.push_back(token_it->second);
        }
    }

    return bpe_tokens;
}


std::vector<int> CLIPTokenizer::tokenize(std::string text, size_t max_length, bool padding)
 {
    std::vector<int> tokens = encode(text);
    tokens.insert(tokens.begin(), BOS_TOKEN_ID);

    if (max_length > 0) {
        if (tokens.size() > max_length - 1) {
            tokens.resize(max_length - 1);
            tokens.push_back(EOS_TOKEN_ID);
        } else {
            tokens.push_back(EOS_TOKEN_ID);
            if (padding && tokens.size() < max_length) {
                tokens.insert(tokens.end(), max_length - tokens.size(), PAD_TOKEN_ID);
            }
        }
    }

    return tokens;
}

后处理clip_post.cpp

/*
 * Company:    AW
 * Author:     zhongzixins
 * Date:    2026/02/09
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include "model_config.h"

typedef struct {
    int img_index;
    int text_index;
    float score;
} clip_res;

typedef struct {
    float value;
    int index;
} element_t;

static void swap(element_t* a, element_t* b) {
    element_t temp = *a;
    *a = *b;
    *b = temp;
}

static int partition(element_t arr[], int low, int high) {
    float pivot = arr[high].value;
    int i = low - 1;

    for (int j = low; j <= high - 1; j++) {
        if (arr[j].value >= pivot) {
            i++;
            swap(&arr[i], &arr[j]);
        }
    }

    swap(&arr[i + 1], &arr[high]);
    return (i + 1);
}

static void quick_sort(element_t arr[], int low, int high) {
    if (low < high) {
        int pi = partition(arr, low, high);
        quick_sort(arr, low, pi - 1);
        quick_sort(arr, pi + 1, high);
    }
}

static void softmax(float* arr, int size) {
    float max_val = arr[0];
    for (int i = 1; i < size; i++) {
        if (arr[i] > max_val) {
            max_val = arr[i];
        }
    }

    for (int i = 0; i < size; i++) {
        arr[i] -= max_val;
    }

    float sum = 0.0;
    for (int i = 0; i < size; i++) {
        arr[i] = expf(arr[i]);
        sum += arr[i];
    }

    for (int i = 0; i < size; i++) {
        arr[i] /= sum;
    }
}

static void element_multiply(float* arr, int size, const float scale) {
    for (int i = 0; i < size; i++) {
        arr[i] *= scale;
    }
}

static void matmul_by_cpu(float* A, float* B, float* out, int A_rows, int A_B_cols, int B_rows) {
    float temp;
    for (int i = 0; i < A_rows; i++) {
        for (int j = 0; j < B_rows; j++) {
            temp = 0;
            for (int k = 0; k < A_B_cols; k++) {
                temp += A[i * A_B_cols + k] * B[j * A_B_cols + k];
            }
            out[i * B_rows + j] = temp;
        }
    }
}

static void get_result_with_index(float* arr, int size, int text_num, clip_res* res) {
    element_t* elements = (element_t*)malloc(size * sizeof(element_t));
    for (int i = 0; i < size; i++) {
        elements[i].value = arr[i];
        elements[i].index = i;
    }

    quick_sort(elements, 0, size - 1);

    res->img_index = elements[0].index / text_num;
    res->text_index = elements[0].index % text_num;
    res->score = elements[0].value;

    free(elements);
}

int post_process(float* img_output, float* text_output, int img_num, int text_num, clip_res* out_res) {
    int out_size = img_num * text_num;
    float* matmul_out = (float*)malloc(out_size * sizeof(float));
    float logit_scale = 4.605170249938965;

    matmul_by_cpu(img_output, text_output, matmul_out, img_num, FEAT_DIM, text_num);

    element_multiply(matmul_out, out_size, expf(logit_scale));

    softmax(matmul_out, out_size);

    get_result_with_index(matmul_out, out_size, text_num, out_res);

    if (matmul_out != NULL) {
        free(matmul_out);
    }


    return 0;
}

主函数 main（）

/*
 * Company:    AW
 * Author:     zhuzhiyongs
 * Date:    2026/03/19
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#include <opencv2/core/core.hpp>

#include "npulib.h"
#include "model_config.h"

/*-------------------------------------------
        Macros and Variables
-------------------------------------------*/

extern int mobilesam_preprocess(const char* imagepath, void* buff_ptr, unsigned int buff_size);
extern int mobilesam_postprocess(const char *imagepath, float **output);
extern void calculate_scale_and_padding(const char* image_file, float& scale_letterbox, int& top, int& left);

const char *usage =
    "model_demo -encoder encoder_model_path -decoder decoder_model_path -i input_path -l loop_run_count -m malloc_mbyte \n"
    "-encoder encoder_model_path:    the encoder NBG file path.\n"
    "-decoder decoder_model_path:    the decoder NBG file path.\n"
    "-i input_path:     the input file path.\n"
    "-l loop_run_count: the number of loop run network.\n"
    "-m malloc_mbyte:   npu_unit init memory Mbytes.\n"
    "-h : help\n"
    "example: model_demo -encoder encoder.nb -decoder decoder.nb -i input.jpg -l 10 -m 20 \n";

enum time_idx_e {
    NPU_INIT = 0,
    NETWORK_CREATE,
    NETWORK_PREPARE,
    NETWORK_PREPROCESS,
    NETWORK_RUN,
    NETWORK_LOOP,
    TIME_IDX_MAX = 9
};

// float -> fp16 
uint16_t float_to_fp16(float value) {
    uint32_t f32 = *(uint32_t*)&value;
    uint32_t sign = (f32 >> 31) & 0x1;
    uint32_t exponent = (f32 >> 23) & 0xff;
    uint32_t mantissa = f32 & 0x7fffff;

    if (exponent == 0xff) {
        return (sign << 15) | 0x7c00 | (mantissa >> 13);
    } else if (exponent == 0) {
        if (mantissa == 0) {
            return sign << 15;
        } else {
            int shift = __builtin_clz(mantissa) - 8;
            mantissa <<= shift;
            exponent = 1 - shift;
            return (sign << 15) | ((exponent & 0x1f) << 10) | (mantissa >> 13);
        }
    } else {
        int new_exponent = exponent - 127 + 15;
        if (new_exponent > 0x1f) {
            return (sign << 15) | 0x7c00;
        } else if (new_exponent < 1) {
            int shift = 1 - new_exponent;
            mantissa = (0x800000 | mantissa) >> shift;
            return (sign << 15) | (mantissa >> 13);
        } else {
            return (sign << 15) | ((new_exponent & 0x1f) << 10) | (mantissa >> 13);
        }
    }
}

// float -> fp16
void float_array_to_fp16(const float* float_array, uint16_t* fp16_array, size_t size) {
    for (size_t i = 0; i < size; i++) {
        fp16_array[i] = float_to_fp16(float_array[i]);
    }
}

int main(int argc, char** argv)
{
    int status = 0;
    int i = 0;
    unsigned int count = 0;
    long long total_infer_time = 0;

    char *encoder_model_file = nullptr;
    char *decoder_model_file = nullptr;
    char *input_file = nullptr;
    unsigned int loop_count = 1;

    if (argc < 2) {
        printf("%s\n", usage);
        return -1;
    }

    for (i = 0; i< argc; i++) {
        if (!strcmp(argv[i], "-encoder")) {
            encoder_model_file = argv[++i];
        }
        else if (!strcmp(argv[i], "-decoder")) {
            decoder_model_file = argv[++i];
        }
        else if (!strcmp(argv[i], "-i")) {
            input_file = argv[++i];
        }
        else if (!strcmp(argv[i], "-l")) {
            loop_count = atoi(argv[++i]);
        }
        else if (!strcmp(argv[i], "-h")) {
            printf("%s\n", usage);
            return 0;
        }
    }
    printf("encoder_model_file=%s, decoder_model_file=%s, input=%s, loop_count=%d \n", encoder_model_file, decoder_model_file, input_file, loop_count);

    if (encoder_model_file == nullptr || decoder_model_file == nullptr)
        return -1;

    /* NPU init*/
    NpuUint npu_uint;

    int ret = npu_uint.npu_init();
    if (ret != 0) {
        return -1;
    }

    // encoder
    NetworkItem encoder;
    unsigned int encoder_id = 0;
    status = encoder.network_create(encoder_model_file, encoder_id);
    if (status != 0) {
        printf("encoder network create failed.\n");
        return -1;
    }

    status = encoder.network_prepare();
    if (status != 0) {
        printf("encoder network prepare fail, status=%d\n", status);
        return -1;
    }

    // decoder
    NetworkItem decoder;
    unsigned int decoder_id = 1;
    status = decoder.network_create(decoder_model_file, decoder_id);
    if (status != 0) {
        printf("decoder network create failed.\n");
        return -1;
    }

    status = decoder.network_prepare();
    if (status != 0) {
        printf("decoder network prepare fail, status=%d\n", status);
        return -1;
    }

    TimeBegin(NETWORK_PREPROCESS);
    // input jpg file, no copy way
    void *input_buffer_ptr = nullptr;
    unsigned int input_buffer_size = 0;
    encoder.get_network_input_buff_info(0, &input_buffer_ptr, &input_buffer_size);

    printf("encoder input buffer ptr: %p, buffer size: %d \n", input_buffer_ptr, input_buffer_size);

    mobilesam_preprocess(input_file, input_buffer_ptr, input_buffer_size);

    TimeEnd(NETWORK_PREPROCESS);
    printf("feed input cost: %lu us.\n", (unsigned long)TimeGet(NETWORK_PREPROCESS));

    // create encoder output buffer
    int encoder_output_cnt = encoder.get_output_cnt();
    float **encoder_output_data = new float*[encoder_output_cnt]();
    for (int i = 0; i < encoder_output_cnt; i++)
        encoder_output_data[i] = new float[encoder.m_output_data_len[i]];

    // create decoder output buffer
    int decoder_output_cnt = decoder.get_output_cnt();
    float **decoder_output_data = new float*[decoder_output_cnt]();
    for (int i = 0; i < decoder_output_cnt; i++) {
        decoder_output_data[i] = new float[decoder.m_output_data_len[i]];
    }

    i = 0;
    /* run network */
    TimeBegin(NETWORK_LOOP);
    while (count < loop_count) {
        count++;

        // run encoder
        status = encoder.network_input_output_set();
        if (status != 0) {
            printf("set encoder input/output failed.\n");
            return -1;
        }

        #if defined (__linux__)
        TimeBegin(NETWORK_RUN);
        #endif

        status = encoder.network_run();
        if (status != 0) {
            printf("fail to run encoder, status=%d\n", status);
            return -2;
        }

        #if defined (__linux__)
        TimeEnd(NETWORK_RUN);
        printf("encoder run time: %lu us.\n", (unsigned long)TimeGet(NETWORK_RUN));
        #endif

        total_infer_time += (unsigned long)TimeGet(NETWORK_RUN);

        // get encoder output
        encoder.get_output(encoder_output_data);
        // Use the encoder output as the decoder input (1, 256, 28, 28).
        void *decoder_input_ptr = nullptr;
        unsigned int decoder_input_size = 0;
        int ret = decoder.get_network_input_buff_info(0, &decoder_input_ptr, &decoder_input_size);
        if (ret == 0 && decoder_input_ptr != nullptr && decoder_input_size > 0) {
            // fp32 -> fp16
            float_array_to_fp16(encoder_output_data[0], (uint16_t*)decoder_input_ptr, encoder.m_output_data_len[0]);
        } else {
            printf("Error: Failed to get decoder input 0 buffer info\n");
            return -1;
        }

        // point_coords input (1, 2, 2)
        void *point_coords_ptr = nullptr;
        unsigned int point_coords_size = 0;
        ret = decoder.get_network_input_buff_info(1, &point_coords_ptr, &point_coords_size);
        if (ret == 0 && point_coords_ptr != nullptr && point_coords_size > 0) {

            float scale_letterbox;
            int top, left;
            calculate_scale_and_padding(input_file, scale_letterbox, top, left);

            float point_coords_float[4] = {
                TOP_LEFT_X * scale_letterbox + left,  // x1
                TOP_LEFT_Y * scale_letterbox + top,  // y1
                BOTTOM_RIGHT_X * scale_letterbox + left,  // x2
                BOTTOM_RIGHT_Y * scale_letterbox + top   // y2
            };
            for (int i = 0; i < 4; i++) {
                if (point_coords_float[i] < 0.0f) point_coords_float[i] = 0.0f;
                if (point_coords_float[i] > LETTERBOX_ROWS) point_coords_float[i] = LETTERBOX_ROWS;
            }
            // float32 -> fp16
            uint16_t point_coords_fp16[4];
            float_array_to_fp16(point_coords_float, point_coords_fp16, 4);
            memcpy(point_coords_ptr, point_coords_fp16, sizeof(point_coords_fp16));
        } else {
            printf("Error: Failed to get decoder input 1 buffer info\n");
            return -1;
        }
        // point_labels
        void *point_labels_ptr = nullptr;
        unsigned int point_labels_size = 0;
        ret = decoder.get_network_input_buff_info(2, &point_labels_ptr, &point_labels_size);
        if (ret == 0 && point_labels_ptr != nullptr && point_labels_size > 0) {
            float point_labels_float[2] = {2.0f, 3.0f};
            // float32 -> fp16
            uint16_t point_labels_fp16[2];
            float_array_to_fp16(point_labels_float, point_labels_fp16, 2);
            memcpy(point_labels_ptr, point_labels_fp16, sizeof(point_labels_fp16));
        } else {
            printf("Error: Failed to get decoder input 2 buffer info\n");
            return -1;
        }

        // mask_input (1, 1, 112, 112)
        void *mask_input_ptr = nullptr;
        unsigned int mask_input_size = 0;
        ret = decoder.get_network_input_buff_info(3, &mask_input_ptr, &mask_input_size);
        if (ret == 0 && mask_input_ptr != nullptr && mask_input_size > 0) {
            memset(mask_input_ptr, 0, mask_input_size);
        } else {
            printf("Error: Failed to get decoder input 3 buffer info\n");
            return -1;
        }
        // has_mask_input
        void *has_mask_input_ptr = nullptr;
        unsigned int has_mask_input_size = 0;
        ret = decoder.get_network_input_buff_info(4, &has_mask_input_ptr, &has_mask_input_size);
        if (ret == 0 && has_mask_input_ptr != nullptr && has_mask_input_size > 0) {
            uint8_t has_mask_input_uint8 = 0;
            memcpy(has_mask_input_ptr, &has_mask_input_uint8, sizeof(has_mask_input_uint8));
        } else {
            printf("Error: Failed to get decoder input 4 buffer info\n");
            return -1;
        }

        // run decoder
        status = decoder.network_input_output_set();
        if (status != 0) {
            printf("set decoder input/output failed.\n");
            return -1;
        }

        #if defined (__linux__)
        TimeBegin(NETWORK_RUN);
        #endif

        status = decoder.network_run();
        if (status != 0) {
            printf("fail to run decoder, status=%d\n", status);
            return -2;
        }

        #if defined (__linux__)
        TimeEnd(NETWORK_RUN);
        printf("decoder run time: %lu us.\n", (unsigned long)TimeGet(NETWORK_RUN));
        #endif

        total_infer_time += (unsigned long)TimeGet(NETWORK_RUN);
        // get decoder output
        decoder.get_output(decoder_output_data);

        // postprocess
        mobilesam_postprocess(input_file, decoder_output_data);

    }
    TimeEnd(NETWORK_LOOP);

    if (loop_count > 1) {
        printf("this network run avg inference time=%d us,  total avg cost: %d us\n",
                (uint32_t)(total_infer_time / loop_count), (unsigned int)(TimeGet(NETWORK_LOOP) / loop_count));
    }

    // free output buffer
    for (int i = 0; i < encoder_output_cnt; i++) {
        delete[] encoder_output_data[i];
        encoder_output_data[i] = nullptr;
    }
    if (encoder_output_data != nullptr)
        delete[] encoder_output_data;

    for (int i = 0; i < decoder_output_cnt; i++) {
        delete[] decoder_output_data[i];
        decoder_output_data[i] = nullptr;
    }
    if (decoder_output_data != nullptr)
        delete[] decoder_output_data;

    return ret;
}

模型转换

注:进入全志的docker环境，并进入awnpu_model_zoo\examples\clip\进行后续操作。

clip-images

# using xxx_env.sh to create softlink
./convert_model_env.sh

# 导入
# pegasus_import.sh <model_name>
./pegasus_import.sh clip-images

# 量化
# pegasus_quantize.sh <model_name> <quantize_type> <calibration_set_size>
./pegasus_quantize.sh clip-images uint8 10

# 仿真（可选）
# pegasus_inference.sh <model_name> <quantize_type>
./pegasus_inference.sh clip-images uint8

# 导出nb模型
# pegasus_export_ovx_nbg.sh <model_name> <quantize_type> <platform>
./pegasus_export_ovx_nbg.sh clip-images uint8 mr536

# 导出的模型文件存放在../model目录
# 例如 ../model/clip-images_uint8_mr536.nb

clip-text

# using xxx_env.sh to create softlink
./convert_model_env.sh

# 导入
# pegasus_import.sh <model_name>
./pegasus_import.sh clip-text

# 量化
# pegasus_quantize.sh <model_name> <quantize_type> <calibration_set_size>
./pegasus_quantize.sh clip-text int16 10

# 仿真（可选）
# pegasus_inference.sh <model_name> <quantize_type>
./pegasus_inference.sh clip-text int16

# 导出nb模型
# pegasus_export_ovx_nbg.sh <model_name> <quantize_type> <platform>
./pegasus_export_ovx_nbg.sh clip-text int16 mr536

# 导出的模型文件存放在../model目录
# 例如 ../model/clip-text_int16_mr536.nb

交叉编译

编译工具链参考其他模型，这里直接给出编译命令

cd ../examples/clip/
./../build_linux.sh -t mr536

在./examples/clip/目录下生成install目录，目录结构如下：

`-- clip_demo_linux_mr536
|-- clip_demo_mr536
`-- model
|-- clip-images_uint8_mr536.nb
|-- clip-text_int16_mr536.nb
|-- demo.png
|-- demo.txt
`-- merges.txt

模型推理

将上述生成的文件推送至开发板，方式不限于adb这一种

adb push .\install\clip_demo_linux_mr536 /mnt/UDISK/

运行推理：

chmod +x ./clip_demo_mr536
./clip_demo_mr536 -ib model/clip-images_uint8_mr536.nb -tb model/clip-text_int16_mr536.nb -i model/demo.png -t
model/demo.txt

板端的运行结果：

images: model/demo.png
text : a photo of a motorcycle
score : 0.998
destory npu finished.
~NpuUint.

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

收藏！AI时代程序员薪资分化严重？3个月转型AI工程，求职成功率提升60%！

AtomGit开源社区

从4篇到40篇：我用AI自动化管道把Wordpress博客产量翻了10倍，顺便治好了颈椎病

AtomGit开源社区

Containerd 容器技术详解

本文完整覆盖原文安装、配置、镜像、容器、任务、插件、命名空间OCI/CRI 生态关系Runtime v2 + Shim 高可用原理生产配置与安全基线常见问题排障ctr/crictl/nerdctl 定位区别Containerd 是云原生底层基石，掌握它可彻底理解容器运行机制，为 K8s 运维、性能调优、安全加固打下坚实基础。