当前位置:   article > 正文

AI大模型应用开发实践:1.Embedding的初次窥探_embedding ai

embedding ai

准备工作

1.确保您在环境中设置了API密钥

2.安装依赖包

!pip install tiktoken openai pandas matplotlib plotly scikit-learn numpy
  • 1

1. 生成 Embedding (基于 text-embedding-ada-002 模型)

嵌入对于处理自然语言和代码非常有用,因为其他机器学习模型和算法(如聚类或搜索)可以轻松地使用和比较它们。
在这里插入图片描述

亚马逊美食评论数据集(amazon-fine-food-reviews)

Source:美食评论数据集
在这里插入图片描述

该数据集包含截至2012年10月用户在亚马逊上留下的共计568,454条美食评论。为了说明目的,我们将使用该数据集的一个子集,其中包括最近1,000条评论。这些评论都是用英语撰写的,并且倾向于积极或消极。每个评论都有一个产品ID、用户ID、评分、标题(摘要)和正文。

我们将把评论摘要和正文合并成一个单一的组合文本。模型将对这个组合文本进行编码,并输出一个单一的向量嵌入。

# 导入 pandas 包。Pandas 是一个用于数据处理和分析的 Python 库
# 提供了 DataFrame 数据结构,方便进行数据的读取、处理、分析等操作。
import pandas as pd
# 导入 tiktoken 库。Tiktoken 是 OpenAI 开发的一个库,用于从模型生成的文本中计算 token 数量。
import tiktoken
  • 1
  • 2
  • 3
  • 4
  • 5
加载数据集
input_datapath = "data/fine_food_reviews_1k.csv"
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()

# 将 "Summary" 和 "Text" 字段组合成新的字段 "combined"
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
Time ProductId UserId Score Summary Text combined
0 1351123200 B003XPF9BO A3R7JR3FMEBXQB 5 where does one start...and stop... with a tre... Wanted to save some to bring to my Chicago fam... Title: where does one start...and stop... wit...
1 1351123200 B003JK537S A3JBPC3WFUT5ZP 1 Arrived in pieces Not pleased at all. When I opened the box, mos... Title: Arrived in pieces; Content: Not pleased...
df["combined"]
  • 1
0      Title: where does one  start...and stop... wit...
1      Title: Arrived in pieces; Content: Not pleased...
2      Title: It isn't blanc mange, but isn't bad . ....
3      Title: These also have SALT and it's not sea s...
4      Title: Happy with the product; Content: My dog...
                             ...                        
995    Title: Delicious!; Content: I have ordered the...
996    Title: Good Training Treat; Content: My dog wi...
997    Title: Jamica Me Crazy Coffee; Content: Wolfga...
998    Title: Party Peanuts; Content: Great product f...
999    Title: I love Maui Coffee!; Content: My first ...
Name: combined, Length: 1000, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

  • 1
Embedding 模型关键参数
# 模型类型
# 建议使用官方推荐的第二代嵌入模型:text-embedding-ada-002
embedding_model = "text-embedding-ada-002"
# text-embedding-ada-002 模型对应的分词器(TOKENIZER)
embedding_encoding = "cl100k_base"
# text-embedding-ada-002 模型支持的输入最大 Token 数是8191,向量维度 1536
# 在我们的 DEMO 中过滤 Token 超过 8000 的文本
max_tokens = 8000  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
将样本减少到最近的1,000个评论,并删除过长的样本
# 设置要筛选的评论数量为1000
top_n = 1000
# 对DataFrame进行排序,基于"Time"列,然后选取最后的2000条评论。
# 这个假设是,我们认为最近的评论可能更相关,因此我们将对它们进行初始筛选。
df = df.sort_values("Time").tail(top_n * 2) 
# 丢弃"Time"列,因为我们在这个分析中不再需要它。
df.drop("Time", axis=1, inplace=True)
# 从'embedding_encoding'获取编码
encoding = tiktoken.get_encoding(embedding_encoding)

# 计算每条评论的token数量。我们通过使用encoding.encode方法获取每条评论的token数,然后把结果存储在新的'n_tokens'列中。
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))

# 如果评论的token数量超过最大允许的token数量,我们将忽略(删除)该评论。
# 我们使用.tail方法获取token数量在允许范围内的最后top_n(1000)条评论。
df = df[df.n_tokens <= max_tokens].tail(top_n)

# 打印出剩余评论的数量。
len(df)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
1000
  • 1
生成 Embeddings 并保存
from openai import OpenAI
  • 1
# OpenAI Python SDK v1.0 更新后的使用方式
client = OpenAI()
  • 1
  • 2
# 新版本创建 Embedding 向量的方法
# Ref:https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663
res = client.embeddings.create(input="abc", model=embedding_model)
print(res.data[0].embedding)
  • 1
  • 2
  • 3
  • 4
[0.002615210600197315, -0.011296011507511139, -0.00963275134563446, -0.039097219705581665, -0.03448256105184555, 0.012145334854722023, -0.021303880959749222, -0.02280435338616371, 0.018685130402445793, -0.00038153232890181243, 0.0033760634250938892, 0.019194725900888443, -0.0026983735151588917, -0.004526189994066954, -0.018840840086340904, 0.0029708649963140488, 0.027093440294265747, 0.010666095651686192, 0.010800572112202644, 0.007530673872679472, -0.014537598006427288, 0.01776503026485443, -0.0069786133244633675, -0.01562756486237049, -0.020199758931994438, -0.003558314172551036, 0.009993714280426502, -0.020058205351233482, 0.026272427290678024, -0.007325420621782541, 0.00774300517514348, 0.013787361793220043, -0.007459897082298994, -0.009944169782102108, -0.010397142730653286, -0.014219101518392563, -0.013645808212459087, -0.015924828127026558, 0.010382987558841705, -0.00026408673147670925, 0.024347292259335518, 0.0045155733823776245, 0.014098781161010265, -0.022960063070058823, -0.012782328762114048, -0.011444643139839172, -0.017241280525922775, -0.015287834219634533, -0.024262359365820885, 0.01572665199637413, -0.007389120291918516, 0.0043846359476447105, -0.025890231132507324, -0.023087460547685623, 0.006684888619929552, -0.007474052254110575, 0.005103022791445255, 0.035331886261701584, 0.0025249698664993048, -0.010602395981550217, 0.0074315862730145454, 0.0067521268501877785, -0.01320699043571949, 0.008068579249083996, 0.00899575836956501, -0.02015729248523712, 0.0003324307908769697, 0.004756215028464794, -0.008394153788685799, 0.007530673872679472, 0.01789242774248123, 0.015938982367515564, 0.008040268905460835, -0.005219804588705301, 0.03043411485850811, -0.011458798311650753, -0.022054117172956467, 0.0006246071425266564, -0.010885504074394703, 0.01484901737421751, 0.004172304645180702, -0.023724455386400223, -0.00904530193656683, 0.013978459872305393, 0.007247566245496273, 0.020822597667574883, -0.0004136031784582883, -0.005283503793179989, -0.0013960765209048986, 0.000543655944056809, 0.025607122108340263, 0.0047279042191803455, 0.015273679047822952, 0.002765611745417118, -0.01572665199637413, -0.006561029236763716, -0.000593199860304594, 0.03187796473503113, 0.0040767560712993145, -0.04521235451102257, 0.020199758931994438, 0.013376855291426182, -0.02045455574989319, -0.01793489418923855, -0.020468711853027344, -0.03289715573191643, -0.005605539306998253, 0.004565117415040731, 0.03216107562184334, -0.010411298833787441, 0.006008968222886324, 0.0042289262637495995, 0.007693461142480373, -0.039578504860401154, 0.005555995274335146, 0.01151541993021965, -0.0026028247084468603, -0.015740808099508286, 0.0034309157636016607, -0.005191493779420853, 0.004458951763808727, 0.028664689511060715, 0.02781536616384983, -0.0031248051673173904, 0.018670976161956787, -0.0018685130635276437, -0.012039169669151306, -0.015641719102859497, 0.014311111532151699, 0.01046084240078926, 0.0396917499601841, 0.01562756486237049, 0.01659013144671917, 0.01694401726126671, -0.026838643476366997, 0.01598144881427288, -0.021813474595546722, 0.024517156183719635, -0.04671282693743706, -0.024545468389987946, -0.0055524567142128944, 0.035360194742679596, -0.020879218354821205, 0.009569051675498486, -0.0008608254138380289, 0.009094846434891224, 0.03397296741604805, 0.002877085469663143, -0.008075657300651073, -0.004363402724266052, 0.01334854494780302, -0.0230308398604393, 0.014403121545910835, 0.021431278437376022, -0.010920893400907516, 0.016377801075577736, 0.0073608094826340675, 0.0023037916980683804, -0.006822904106229544, -0.016391955316066742, -0.007049390580505133, 0.021261414512991905, 0.021586988121271133, -0.016830774024128914, 0.002547972369939089, -0.004441257566213608, 0.02921675145626068, 0.008394153788685799, -0.01610884815454483, -0.025946851819753647, -0.017538543790578842, 3.782146450248547e-05, -0.007127244956791401, -0.001010341802611947, -0.016123004257678986, 0.024389758706092834, -0.006638883613049984, 0.010333443991839886, -0.02593269757926464, -0.011989626102149487, -0.00022549113782588392, 0.01869928650557995, 0.013185757212340832, 0.02845235913991928, -0.01441727764904499, 0.006670733448117971, 0.01709972694516182, -0.020270535722374916, 0.022195670753717422, -0.032557424157857895, 0.0020330697298049927, 0.002670062705874443, 0.02083675190806389, -0.024998441338539124, -0.6858009099960327, -0.009774304926395416, 0.004055522847920656, -0.018133070319890976, 0.0073466538451612, 0.03201952204108238, 0.016193781048059464, 0.0005786020774394274, -0.021615300327539444, 0.004328014329075813, 0.010913815349340439, 0.00229671411216259, -0.0018313551554456353, -0.00036892518983222544, -0.00587803078815341, -0.01277525071054697, 0.010609474033117294, -0.015259523876011372, -0.0011262391926720738, 0.023950939998030663, -0.014905638992786407, 0.013822750188410282, -0.021261414512991905, -0.0056692385114729404, 0.019534455612301826, 0.0009581437916494906, 0.006872447673231363, -0.031056953594088554, -0.017779186367988586, -0.0031407298520207405, -0.012286889366805553, 0.0045509617775678635, 0.0011625124607235193, 0.023526279255747795, 0.045127421617507935, 0.012499220669269562, -0.015443543903529644, 0.020695198327302933, 0.021841786801815033, 0.028593912720680237, 0.0027939225547015667, -0.01510381419211626, 0.01671753078699112, 0.0034397628623992205, -0.013369777239859104, 0.00900991354137659, 0.013426398858428001, -0.0009028492495417595, 0.01108368020504713, -0.01847280003130436, 0.027631346136331558, -0.011833916418254375, -0.014933949336409569, -0.0021622376516461372, 0.024134961888194084, -0.009024068713188171, 0.024828575551509857, -0.0027709200512617826, -0.0012332894839346409, 0.008464930579066277, -0.009413342922925949, 0.007290032226592302, -0.004685438238084316, -0.040880803018808365, -0.018274623900651932, 0.0213746577501297, -0.015740808099508286, 0.01575496233999729, 0.033321816474199295, -0.017920739948749542, -0.011833916418254375, 0.007403275463730097, -0.0015199363697320223, -0.012223189696669579, 0.018515266478061676, 0.011642818339169025, 0.01926550269126892, 0.005453368648886681, -0.024587934836745262, 0.01051038596779108, 0.0006905182381160557, 0.012470909394323826, -0.008394153788685799, 0.01048207562416792, 0.022662799805402756, 0.013242378830909729, -0.027036817744374275, 0.004788064863532782, 0.013164523988962173, -0.01700063794851303, 0.02730577066540718, 0.028763778507709503, -0.005987734999507666, -0.011494186706840992, 0.014169557951390743, 0.01608053781092167, 3.502909021335654e-05, 0.008832971565425396, 0.00271960673853755, -0.03756843879818916, -0.009710606187582016, 0.0004341727471910417, 0.019959118217229843, -0.006419475190341473, 0.008592328988015652, -0.007077701389789581, 0.009894626215100288, 0.020723508670926094, 0.04778863862156868, -0.023823542520403862, -0.0076226843520998955, -0.0025302781723439693, -0.005315353628247976, -0.006493791006505489, 0.01605222560465336, -0.040965735912323, 0.008401230908930302, -0.008691417053341866, 0.01694401726126671, -0.015613408759236336, 0.011111990548670292, -0.0012288658181205392, 0.013723663054406643, 0.0039812070317566395, -0.010531619191169739, 0.010595318861305714, 0.004957929719239473, -0.0022825587075203657, -0.003404374234378338, -0.007417431101202965, -0.00263644359074533, 0.005237498786300421, 0.016802461817860603, -0.008691417053341866, 0.006642422638833523, 0.009944169782102108, 0.010290977545082569, -0.025196615606546402, 0.0057647875510156155, 0.0029956370126456022, -0.02656969055533409, -0.022790197283029556, -0.004766831640154123, -0.01150126475840807, -0.0022949445992708206, -0.026753710582852364, -0.02335641346871853, 0.000818359199911356, -0.011317243799567223, 0.008273832499980927, -0.023398879915475845, -0.005191493779420853, -0.007757160346955061, 0.032840535044670105, 0.0034397628623992205, -0.04229634255170822, 0.0076226843520998955, -0.022153204306960106, -0.003230970585718751, 0.025338171049952507, 0.024828575551509857, -0.0017853501485660672, -0.006164677906781435, 0.0197184756398201, 0.004628816619515419, -0.017793340608477592, 0.005149027798324823, -0.006341620348393917, -0.01809060387313366, -0.028282493352890015, 0.0012757556978613138, -0.03326519578695297, -0.0062673045322299, 0.024234049022197723, 0.0031283439602702856, 0.007856247946619987, -0.010772260837256908, 0.016774151474237442, 0.0150755038484931, 0.0001987285795621574, 0.0012589461402967572, 0.0057152435183525085, -0.007629761938005686, -0.009774304926395416, 0.025182461366057396, 0.009179778397083282, 0.011706518009305, 0.01613715849816799, -0.027319926768541336, 0.0024948897771537304, -0.007934102788567543, 0.010906737297773361, 0.010149423964321613, -0.0003963512717746198, -0.011055368930101395, -0.01051038596779108, 0.007842092774808407, 0.00587803078815341, 0.01684492826461792, 0.018543576821684837, 0.035388506948947906, 0.0038785801734775305, 0.01073687244206667, 0.01751023344695568, 0.011041213758289814, -0.023582899942994118, 0.00029040692606940866, -0.02450300194323063, 0.03436931595206261, 0.010524542070925236, -0.010198967531323433, -0.018826685845851898, -0.0041121444664895535, -0.01984587498009205, -0.01073687244206667, 0.03541681542992592, -0.024064183235168457, 0.004614660982042551, 0.022252293303608894, -0.006369931157678366, -0.012732784263789654, -0.00610097823664546, 0.002342719119042158, 0.027716277167201042, -0.00753775192424655, -0.0051454887725412846, -0.005828486755490303, -0.004869458265602589, -0.015259523876011372, -0.007934102788567543, -0.0001561517856316641, 0.007438663858920336, -0.000964336795732379, 0.01748192124068737, 0.006720277480781078, 0.0006967112421989441, 0.008875437080860138, -0.023526279255747795, 0.028225872665643692, -0.013475943356752396, 0.0062779211439192295, 0.02893364243209362, 0.02051117829978466, 0.0024842731654644012, 0.020426245406270027, -0.005559534300118685, 0.02315823920071125, 0.013249456882476807, -0.007311265449970961, 0.006638883613049984, -0.012916804291307926, 0.000399447773816064, -0.040371209383010864, 0.03017931804060936, 0.011027058586478233, 0.006953841540962458, 0.02015729248523712, 0.013730740174651146, 0.01595313847064972, 0.023214859887957573, 0.0070246183313429356, 0.014679152518510818, 0.0034786900505423546, 0.00303456443361938, 0.019449522718787193, 0.004282009322196245, 0.004412946756929159, -0.02335641346871853, -0.019605232402682304, -0.017340367659926414, 0.00567985512316227, -0.03221769630908966, -0.01913810335099697, -0.011090758256614208, 0.017496077343821526, -0.024290669709444046, -0.03606796637177467, 0.006090362090617418, 0.019378745928406715, -0.016759997233748436, -0.00437755836173892, -0.02191256359219551, 0.0038007255643606186, 0.01694401726126671, -0.0027479175478219986, -0.015146280638873577, -0.0003709157754201442, 0.004543884191662073, -0.03465242683887482, 0.01974678598344326, -0.007856247946619987, -0.008592328988015652, 0.0022878670133650303, 0.005594922695308924, 0.004306781105697155, -0.003977668005973101, 0.010694406926631927, -0.01020604558289051, -0.035105399787425995, -0.012223189696669579, -0.008733883500099182, -0.02651306800544262, -0.006012507248669863, -0.015401077456772327, 0.0432589091360569, 0.0037901089526712894, -0.004444796591997147, -0.020213915035128593, -0.02066688798367977, 0.0022595562040805817, -0.012272734194993973, 0.008259677328169346, -0.024234049022197723, 0.0016429113456979394, -0.004487262573093176, -0.008203055709600449, -0.01872759684920311, -0.009243478067219257, 0.034878913313150406, -0.008075657300651073, -0.02318654954433441, -0.003726409748196602, -0.014360656030476093, -0.003294670023024082, 0.05744262412190437, 0.013426398858428001, -0.006019584834575653, 0.024743642657995224, -0.009915859438478947, 0.018232157453894615, -0.008068579249083996, -0.0032522038090974092, 0.020015738904476166, -0.002224167576059699, 0.008797582238912582, -0.00468189921
    声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/695429
    推荐阅读
      

    闽ICP备14008679号