当前位置:   article > 正文

CLIP模型导出ONNX模型_clip onnx

clip onnx

openai CLIP

GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

模型权重:

https://huggingface.co/openai/clip-vit-base-patch32 

  1. from PIL import Image
  2. import requests
  3. import torch
  4. import torch.nn as nn
  5. from transformers import CLIPProcessor, CLIPModel
  6. model = CLIPModel.from_pretrained("clip-vit-base-patch32")
  7. processor = CLIPProcessor.from_pretrained("clip-vit-base-patch32")
  8. url = "000000039769.jpg"
  9. image = Image.open(url)
  10. inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)
  11. print("inputs:", inputs.keys())
  12. outputs = model(**inputs)
  13. logits_per_image = outputs.logits_per_image # this is the image-text similarity score
  14. probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
  15. class ImgModelWrapper(nn.Module):
  16. def __init__(self, model):
  17. super(ImgModelWrapper, self).__init__()
  18. self.model = model
  19. def forward(self, pixel_values):
  20. image_features = model.get_image_features(pixel_values=pixel_values)
  21. return image_features
  22. class TxtModelWrapper(nn.Module):
  23. def __init__(self, model):
  24. super(TxtModelWrapper, self).__init__()
  25. self.model = model
  26. def forward(self, input_ids, attention_mask):
  27. text_features = model.get_text_features(input_ids=input_ids, attention_mask=attention_mask)
  28. return text_features
  29. img_model = ImgModelWrapper(model)
  30. txt_model = TxtModelWrapper(model)
  31. torch.onnx.export(img_model, # model being run
  32. (inputs.pixel_values), # model input (or a tuple for multiple inputs)
  33. "clip_img.onnx", # where to save the model (can be a file or file-like object)
  34. export_params=True, # store the trained parameter weights inside the model file
  35. opset_version=15, # the ONNX version to export the model to
  36. do_constant_folding=False, # whether to execute constant folding for optimization
  37. input_names=['pixel_values'], # the model's input names
  38. # output_names=['output'], # the model's output names
  39. # dynamic_axes={'pixel_values': {0: 'batch', 2: 'hight', 3: 'width'}},
  40. )
  41. torch.onnx.export(txt_model, # model being run
  42. (inputs.input_ids, inputs.attention_mask), # model input (or a tuple for multiple inputs)
  43. "clip_txt.onnx", # where to save the model (can be a file or file-like object)
  44. export_params=True, # store the trained parameter weights inside the model file
  45. opset_version=15, # the ONNX version to export the model to
  46. do_constant_folding=False, # whether to execute constant folding for optimization
  47. input_names=['input_ids', 'attention_mask'], # the model's input names
  48. # output_names=['output'], # the model's output names
  49. dynamic_axes={'input_ids': {0: 'batch', 1: 'seq'},
  50. 'attention_mask': {0: 'batch', 1: 'seq'}},
  51. )

chinese-clip用上面类似的方法

GitHub - OFA-Sys/Chinese-CLIP: Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/喵喵爱编程/article/detail/850145
推荐阅读
相关标签
  

闽ICP备14008679号