pip3 install dashvector dashscope

本教程在前述教程(DashVector + ModelScope玩转多模态检索)的基础之上,基于DashScope上新推出的ONE-PEACE通用多模态表征模型结合向量检索服务DashVector来对多模态检索进行升级,接下来我们将展示更丰富的多模态检索能力。




  1. 多模态数据Embedding入库。通过ONE-PEACE模型服务Embedding接口将多种模态的数据集数据转化为高维向量。

  2. 多模态Query检索。基于ONE-PEACE模型提供的多模态Embedding能力,我们可以自由组合不同模态的输入,例如单文本、文本+音频、音频+图片等多模态输入,获取Embedding向量后通过DashVector跨模态检索相似结果。


1. API-KEY 准备

2. 环境准备



需要提前安装Python3.7 及以上版本,请确保相应的python版本。

  1. # 安装 dashscope 和 dashvector sdk
  2. pip3 install dashscope dashvector


1. 数据准备


由于DashScope的ONE-PEACE模型服务当前只支持URL形式的图片、音频输入,因此需要将数据集提前上传到公共网络存储(例如 oss/s3),并获取对应图片、音频的url地址列表。


2. 数据Embedding入库


本教程所涉及的 your-xxx-api-key 以及 your-xxx-cluster-endpoint,均需要替换为您自己的API-KAY及CLUSTER_ENDPOINT后,代码才能正常运行。


  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client, Doc, DashVectorException
  4. dashscope.api_key = '{your-dashscope-api-key}'
  5. # 由于 ONE-PEACE 模型服务当前只支持 url 形式的图片、音频输入,因此用户需要将数据集提前上传到
  6. # 公共网络存储(例如 oss/s3),并获取对应图片、音频的 url 列表。
  7. # 该文件每行存储数据集单张图片的公共 url,与当前python脚本位于同目录下
  8. IMAGENET1K_URLS_FILE_PATH = "imagenet1k-urls.txt"
  9. def index_image():
  10. # 初始化 dashvector client
  11. client = Client(
  12. api_key='{your-dashvector-api-key}',
  13. endpoint='{your-dashvector-cluster-endpoint}'
  14. )
  15. # 创建集合:指定集合名称和向量维度, ONE-PEACE 模型产生的向量统一为 1536
  16. rsp = client.create('imagenet1k_val_embedding', 1536)
  17. if not rsp:
  18. raise DashVectorException(rsp.code, reason=rsp.message)
  19. # 调用 dashscope ONE-PEACE 模型生成图片 Embedding,并插入 dashvector
  20. collection = client.get('imagenet1k_val_embedding')
  21. with open(IMAGENET1K_URLS_FILE_PATH, 'r') as file:
  22. for i, line in enumerate(file):
  23. url = line.strip('\n')
  24. input = [{'image': url}]
  25. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  26. input=input,
  27. auto_truncation=True)
  28. if result.status_code != 200:
  29. print(f"ONE-PEACE failed to generate embedding of {url}, result: {result}")
  30. continue
  31. embedding = result.output["embedding"]
  32. collection.insert(
  33. Doc(
  34. id=str(i),
  35. vector=embedding,
  36. fields={'image_url': url}
  37. )
  38. )
  39. if (i + 1) % 100 == 0:
  40. print(f"---- Succeeded to insert {i + 1} image embeddings")
  41. if __name__ == '__main__':
  42. index_image()


  1. 上述代码需要访问DashScope的ONE-PEACE多模态Embedding模型,总体运行速度视用户开通该服务的qps有所不同。

  2. 因图片大小影响ONE-PEACE模型获取Embedding的成功与否,上述代码运行后最终入库数据可能小于50000条。

3. 模态检索

3.1. 文本检索

对于单文本模态检索,可以通过ONE-PEACE模型获取文本Embedding向量,再通过DashVector向量检索服务的检索接口,快速检索相似的底库图片。这里文本query是猫 “cat”,代码示例如下:

  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image(image_list):
  8. for img in image_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. def text_search(input_text):
  13. # 初始化 dashvector client
  14. client = Client(
  15. api_key='{your-dashvector-api-key}',
  16. endpoint='{your-dashvector-cluster-endpoint}'
  17. )
  18. # 获取上述入库的集合
  19. collection = client.get('imagenet1k_val_embedding')
  20. # 获取文本 query 的 Embedding 向量
  21. input = [{'text': input_text}]
  22. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  23. input=input,
  24. auto_truncation=True)
  25. if result.status_code != 200:
  26. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  27. text_vector = result.output["embedding"]
  28. # DashVector 向量检索
  29. rsp = collection.query(text_vector, topk=3)
  30. image_list = list()
  31. for doc in rsp:
  32. img_url = doc.fields['image_url']
  33. img = Image.open(urlopen(img_url))
  34. image_list.append(img)
  35. return image_list
  36. if __name__ == '__main__':
  37. """文本检索"""
  38. # 猫
  39. text_query = "cat"
  40. show_image(text_search(text_query))





3.2. 音频检索


  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image(image_list):
  8. for img in image_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. def audio_search(input_audio):
  13. # 初始化 dashvector client
  14. client = Client(
  15. api_key='{your-dashvector-api-key}',
  16. endpoint='{your-dashvector-cluster-endpoint}'
  17. )
  18. # 获取上述入库的集合
  19. collection = client.get('imagenet1k_val_embedding')
  20. # 获取音频 query 的 Embedding 向量
  21. input = [{'audio': input_audio}]
  22. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  23. input=input,
  24. auto_truncation=True)
  25. if result.status_code != 200:
  26. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  27. audio_vector = result.output["embedding"]
  28. # DashVector 向量检索
  29. rsp = collection.query(audio_vector, topk=3)
  30. image_list = list()
  31. for doc in rsp:
  32. img_url = doc.fields['image_url']
  33. img = Image.open(urlopen(img_url))
  34. image_list.append(img)
  35. return image_list
  36. if __name__ == '__main__':
  37. """音频检索"""
  38. # 猫叫声
  39. audio_url = "http://proxima-internal.oss-cn-zhangjiakou.aliyuncs.com/audio-dataset/esc-50/1-47819-A-5.wav"
  40. show_image(audio_search(audio_url))





3.3. 文本+音频检索


  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image(image_list):
  8. for img in image_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. def text_audio_search(input_text, input_audio):
  13. # 初始化 dashvector client
  14. client = Client(
  15. api_key='{your-dashvector-api-key}',
  16. endpoint='{your-dashvector-cluster-endpoint}'
  17. )
  18. # 获取上述入库的集合
  19. collection = client.get('imagenet1k_val_embedding')
  20. # 获取文本+音频 query 的 Embedding 向量
  21. input = [
  22. {'text': input_text},
  23. {'audio': input_audio},
  24. ]
  25. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  26. input=input,
  27. auto_truncation=True)
  28. if result.status_code != 200:
  29. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  30. text_audio_vector = result.output["embedding"]
  31. # DashVector 向量检索
  32. rsp = collection.query(text_audio_vector, topk=3)
  33. image_list = list()
  34. for doc in rsp:
  35. img_url = doc.fields['image_url']
  36. img = Image.open(urlopen(img_url))
  37. image_list.append(img)
  38. return image_list
  39. if __name__ == '__main__':
  40. """文本+音频检索"""
  41. # 草地
  42. text_query = "grass"
  43. # 猫叫声
  44. audio_url = "http://proxima-internal.oss-cn-zhangjiakou.aliyuncs.com/audio-dataset/esc-50/1-47819-A-5.wav"
  45. show_image(text_audio_search(text_query, audio_url))





3.4. 图片+音频检索

我们再尝试下“图片+音频”联合模态检索,与前述“文本+音频”检索类似,这里的图片选取的是草地图像(需先上传到公共网络存储并获取 url),音频query依然选择的是ESC-50的“猫叫声”片段。代码示例如下:

  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image(image_list):
  8. for img in image_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. def image_audio_search(input_image, input_audio):
  13. # 初始化 dashvector client
  14. client = Client(
  15. api_key='{your-dashvector-api-key}',
  16. endpoint='{your-dashvector-cluster-endpoint}'
  17. )
  18. # 获取上述入库的集合
  19. collection = client.get('imagenet1k_val_embedding')
  20. # 获取图片+音频 query 的 Embedding 向量
  21. # 注意,这里音频 audio 模态输入的权重参数 factor 为 2(默认为1
  22. # 目的是为了增大音频输入(猫叫声)对检索结果的影响
  23. input = [
  24. {'factor': 1, 'image': input_image},
  25. {'factor': 2, 'audio': input_audio},
  26. ]
  27. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  28. input=input,
  29. auto_truncation=True)
  30. if result.status_code != 200:
  31. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  32. image_audio_vector = result.output["embedding"]
  33. # DashVector 向量检索
  34. rsp = collection.query(image_audio_vector, topk=3)
  35. image_list = list()
  36. for doc in rsp:
  37. img_url = doc.fields['image_url']
  38. img = Image.open(urlopen(img_url))
  39. image_list.append(img)
  40. return image_list
  41. if __name__ == '__main__':
  42. """图片+音频检索"""
  43. # 草地
  44. image_url = "http://proxima-internal.oss-cn-zhangjiakou.aliyuncs.com/image-dataset/grass-field.jpeg"
  45. # 猫叫声
  46. audio_url = "http://proxima-internal.oss-cn-zhangjiakou.aliyuncs.com/audio-dataset/esc-50/1-47819-A-5.wav"
  47. show_image(image_audio_search(image_url, audio_url))









微软COCO的Captioning validation验证集包含5000张标注良好的图片及对应的说明文本,这里我们需要通过 DashScope的ONE-PEACE模型提取数据集的“图片+文本”的Embedding向量入库,另外为了方便后续的图片展示,我们也将原始图片url和对应caption文本一起入库。代码示例如下:

  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client, Doc, DashVectorException
  4. dashscope.api_key = '{your-dashscope-api-key}'
  5. # 由于 ONE-PEACE 模型服务当前只支持 url 形式的图片、音频输入,因此用户需要将数据集提前上传到
  6. # 公共网络存储(例如 oss/s3),并获取对应图片、音频的 url 列表。
  7. # 该文件每行存储数据集单张图片的公共 url 和对应的 caption 文本,以`;`分割
  8. COCO_CAPTIONING_URLS_FILE_PATH = "cocoval5k-urls-captions.txt"
  9. def index_image_text():
  10. # 初始化 dashvector client
  11. client = Client(
  12. api_key='{your-dashvector-api-key}',
  13. endpoint='{your-dashvector-cluster-endpoint}'
  14. )
  15. # 创建集合:指定集合名称和向量维度, ONE-PEACE 模型产生的向量统一为 1536
  16. rsp = client.create('coco_val_embedding', 1536)
  17. if not rsp:
  18. raise DashVectorException(rsp.code, reason=rsp.message)
  19. # 调用 dashscope ONE-PEACE 模型生成图片 Embedding,并插入 dashvector
  20. collection = client.get('coco_val_embedding')
  21. with open(COCO_CAPTIONING_URLS_FILE_PATH, 'r') as file:
  22. for i, line in enumerate(file):
  23. url, caption = line.strip('\n').split(";")
  24. input = [
  25. {'text': caption},
  26. {'image': url},
  27. ]
  28. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  29. input=input,
  30. auto_truncation=True)
  31. if result.status_code != 200:
  32. print(f"ONE-PEACE failed to generate embedding of {url}, result: {result}")
  33. continue
  34. embedding = result.output["embedding"]
  35. collection.insert(
  36. Doc(
  37. id=str(i),
  38. vector=embedding,
  39. fields={'image_url': url, 'image_caption': caption}
  40. )
  41. )
  42. if (i + 1) % 20 == 0:
  43. print(f"---- Succeeded to insert {i + 1} image embeddings")
  44. if __name__ == '__main__':
  45. index_image_text()



3. 模态检索

3.1. 文本检索


  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image_text(image_text_list):
  8. for img, cap in image_text_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. print(cap)
  13. def text_search(input_text):
  14. # 初始化 dashvector client
  15. client = Client(
  16. api_key='{your-dashvector-api-key}',
  17. endpoint='{your-dashvector-cluster-endpoint}'
  18. )
  19. # 获取上述入库的集合
  20. collection = client.get('coco_val_embedding')
  21. # 获取文本 query 的 Embedding 向量
  22. input = [{'text': input_text}]
  23. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  24. input=input,
  25. auto_truncation=True)
  26. if result.status_code != 200:
  27. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  28. text_vector = result.output["embedding"]
  29. # DashVector 向量检索
  30. rsp = collection.query(text_vector, topk=3)
  31. image_text_list = list()
  32. for doc in rsp:
  33. img_url = doc.fields['image_url']
  34. img_cap = doc.fields['image_caption']
  35. img = Image.open(urlopen(img_url))
  36. image_text_list.append((img, img_cap))
  37. return image_text_list
  38. if __name__ == '__main__':
  39. """文本检索"""
  40. # 狗
  41. text_query = "dog"
  42. show_image_text(text_search(text_query))



The fur on this dog is long enough to cover his eyes.


A picture of a dog on a bed.


A dog going to the bathroom in the park.

3.2. 音频检索


  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image_text(image_text_list):
  8. for img, cap in image_text_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. print(cap)
  13. def audio_search(input_audio):
  14. # 初始化 dashvector client
  15. client = Client(
  16. api_key='{your-dashvector-api-key}',
  17. endpoint='{your-dashvector-cluster-endpoint}'
  18. )
  19. # 获取上述入库的集合
  20. collection = client.get('coco_val_embedding')
  21. # 获取音频 query 的 Embedding 向量
  22. input = [{'audio': input_audio}]
  23. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  24. input=input,
  25. auto_truncation=True)
  26. if result.status_code != 200:
  27. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  28. audio_vector = result.output["embedding"]
  29. # DashVector 向量检索
  30. rsp = collection.query(audio_vector, topk=3)
  31. image_text_list = list()
  32. for doc in rsp:
  33. img_url = doc.fields['image_url']
  34. img_cap = doc.fields['image_caption']
  35. img = Image.open(urlopen(img_url))
  36. image_text_list.append((img, img_cap))
  37. return image_text_list
  38. if __name__ == '__main__':
  39. """"音频检索"""
  40. # dog bark
  41. audio_url = "http://proxima-internal.oss-cn-zhangjiakou.aliyuncs.com/audio-dataset/esc-50/1-100032-A-0.wav"
  42. show_image_text(audio_search(audio_url))



The fur on this dog is long enough to cover his eyes.


A dog standing on a bed in a room.


A small black and white dog with the wind blowing through it's hair.

3.3. 文本+音频检索


  1. import dashscope
  2. from dashscope import MultiModalEmbedding
  3. from dashvector import Client
  4. from urllib.request import urlopen
  5. from PIL import Image
  6. dashscope.api_key = '{your-dashscope-api-key}'
  7. def show_image_text(image_text_list):
  8. for img, cap in image_text_list:
  9. # 注意:show() 函数在 Linux 服务器上可能需要安装必要的图像浏览器组件才生效
  10. # 建议在支持 jupyter notebook 的服务器上运行该代码
  11. img.show()
  12. print(cap)
  13. def text_audio_search(input_text, input_audio):
  14. # 初始化 dashvector client
  15. client = Client(
  16. api_key='{your-dashvector-api-key}',
  17. endpoint='{your-dashvector-cluster-endpoint}'
  18. )
  19. # 获取上述入库的集合
  20. collection = client.get('coco_val_embedding')
  21. # 获取文本+音频 query 的 Embedding 向量
  22. input = [
  23. {'text': input_text},
  24. {'audio': input_audio},
  25. ]
  26. result = MultiModalEmbedding.call(model=MultiModalEmbedding.Models.multimodal_embedding_one_peace_v1,
  27. input=input,
  28. auto_truncation=True)
  29. if result.status_code != 200:
  30. raise Exception(f"ONE-PEACE failed to generate embedding of {input}, result: {result}")
  31. text_audio_vector = result.output["embedding"]
  32. # DashVector 向量检索
  33. rsp = collection.query(text_audio_vector, topk=3)
  34. image_text_list = list()
  35. for doc in rsp:
  36. img_url = doc.fields['image_url']
  37. img_cap = doc.fields['image_caption']
  38. img = Image.open(urlopen(img_url))
  39. image_text_list.append((img, img_cap))
  40. return image_text_list
  41. if __name__ == '__main__':
  42. """文本+音频检索"""
  43. text_query = "beach"
  44. # 狗叫声
  45. audio_url = "http://proxima-internal.oss-cn-zhangjiakou.aliyuncs.com/audio-dataset/esc-50/1-100032-A-0.wav"
  46. show_image_text(text_audio_search(text_query, audio_url))



a couple of dogs stand on a beach next to some water.


A view of a beach that has some people sitting on it.


people enjoy swimming in the waves of the ocean on a sunny day at the beach.

观察上述检索结果,发现后两张图的重点更多的是在展示 “beach” 文本输入对应的沙滩,而 “狗叫声片段”音频输入指示的狗的图片形象则不明显,其中第二张图需要放大后才可以看到图片中站立在水中的狗,第三张图中基本没有狗的形象。


  1. # 其他代码一致
  2. # 通过 `factor` 参数来调整不同模态输入的权重,默认为 1,这里设置 audio 为 2
  3. input = [
  4. {'factor': 1, 'text': input_text},
  5. {'factor': 2, 'audio': input_audio},
  6. ]

替换 input后,运行上述代码,结果如下:


a couple of dogs stand on a beach next to some water.


A beautiful woman in a bikini surfing with her dog.


A small black and white dog with the wind blowing through it's hair.





