当前位置:   article > 正文

使用python获取字段相似度匹配最接近的值_python列表里找一个最接近

python列表里找一个最接近

目录

一、前言

二、简单介绍fuzzywuzzy

三、mysql表数据对照,取相似度最高值

四、将代码打包成可执行文件


一、前言

对于做数据治理、数据集成平台都会涉及到数据标准化、mdm主数据管理,说白了就是字典对照,那么对于一些又多又杂的数据字典(如医疗中的诊断、检验项目)人工对照实在是耗时耗力耗眼睛。

针对上面的问题,本文介绍了以下几点内容:

1.python第三方库 fuzzywuzzy

2.python打包可执行文件

二、简单介绍fuzzywuzzy

fuzzywuzzy是一个Python库,用于模糊字符串匹配,可以帮助你在处理文本数据时,找到最相似的字符串。它使用Levenshtein距离算法来计算两个字符串之间的相似度。

fuzzywuzzy库提供了两个主要的函数:fuzz  和  process

fuzz 函数用于计算两个字符串之间的相似度,它返回一个介于0和100之间的整数,表示两个字符串的相似度,示例如下:

  1. from fuzzywuzzy import fuzz
  2. string1 = "apple"
  3. string2 = "appel"
  4. similarity = fuzz.ratio(string1, string2)
  5. print(similarity) # 输出: 91

process函数用于在一个字符串列表中查找与给定字符串最相似的字符,示例如下:

  1. from fuzzywuzzy import process
  2. string = "apple"
  3. choices = ["appel", "banana", "orange", "pear"]
  4. result = process.extractOne(string, choices)
  5. print(result) # 输出: ('appel', 91)

三、mysql表数据对照,取相似度最高值

在dic_map表中存有检验项目标准字典,我将源数据字典存储在了source_lab_item表中

1.连接数据库查询dic_map、source_lab_item表的数据

  1. import pymysql
  2. mydb = pymysql.connect(
  3. host="127.0.0.1",
  4. port=3306,
  5. user="user",
  6. password="pwd",
  7. database="database_name"
  8. )
  1. mycursor = mydb.cursor()
  2. mycursor.execute("SELECT lab_item FROM dic_map")
  3. result_a = mycursor.fetchall()
  4. mycursor.execute("SELECT source_lab_item FROM source_lab_item")
  5. result_b = mycursor.fetchall()

2.遍历表dic_map中的每一条数据,使用fuzzywuzzy库中的ratio函数计算表dic_map中的数据与表source_lab_item中的数据的相似度,取相似度最高的值存放在matched_item变量中。

  1. for item_a in result_a:
  2. max_ratio = 0
  3. matched_item = ""
  4. for item_b in result_b:
  5. ratio = fuzz.ratio(item_a[0], item_b[0])
  6. if ratio > max_ratio:
  7. max_ratio = ratio
  8. matched_item = item_b[0]

3.建个临时表mdm_lab_item(lab_item, source_lab_item) ,将匹配结果存放在该表中

  1. sql = "INSERT INTO mdm_lab_item (lab_item, source_lab_item) VALUES (%s, %s)"
  2. val = (item_a[0], matched_item)
  3. mycursor.execute(sql, val)
  4. mydb.commit()

4.上代码

  1. from fuzzywuzzy import fuzz
  2. import pandas as pd
  3. import pymysql
  4. mydb = pymysql.connect(
  5. host="127.0.0.1",
  6. port=3306,
  7. user="user",
  8. password="pwd",
  9. database="database_name"
  10. )
  11. # 查询表A
  12. mycursor = mydb.cursor()
  13. mycursor.execute("SELECT lab_item FROM dic_map where type = 'lab'")
  14. result_a = mycursor.fetchall()
  15. mycursor.execute("SELECT source_lab_item FROM source_lab_item")
  16. result_b = mycursor.fetchall()
  17. for item_a in result_a:
  18. max_ratio = 0
  19. matched_item = ""
  20. for item_b in result_b:
  21. ratio = fuzz.ratio(item_a[0], item_b[0])
  22. if ratio > max_ratio:
  23. max_ratio = ratio
  24. matched_item = item_b[0]
  25. # 将最接近的匹配结果存放在表C
  26. sql = "INSERT INTO mdm_lab_item (lab_item, source_lab_item) VALUES (%s, %s)"
  27. val = (item_a[0], matched_item)
  28. mycursor.execute(sql, val)
  29. mydb.commit()

四、将代码打包成可执行文件

1.安装pyinstaller库,win+R  cmd

pip install pyinstaller

2.将.py文件放在单独的文件夹下,我这里放在D盘的demo(在其他目录下太乱了怕你找不到)

同样的在winddows命令窗口切换到该路径下,执行

pyinstaller -F -p D:\python\venv\Lib\site-packages 检验项目相似度匹配.py

- p 后边的路径是需要用到的第三方库,例如本次使用的pymysql、fuzzywuzzy、pandas

 

3.执行后能看到生成了两个文件夹build、dist,在dist中有.exe可执行文件,将整个文件夹demo打包后,放在其他没有python环境的电脑也能执行了

4.自动对照的数据示例:

补充:我在后边完善了下,总的代码如下

  1. import sys
  2. from PyQt5.QtWidgets import QApplication, QWidget, QLabel, QLineEdit, QPushButton, QVBoxLayout,QGridLayout ,QMessageBox
  3. import pymysql
  4. from fuzzywuzzy import fuzz
  5. class MySQLConnector(QWidget):
  6. def __init__(self):
  7. super().__init__()
  8. self.initUI()
  9. def initUI(self):
  10. # 创建标签和文本框
  11. self.ip_label = QLabel('IP地址')
  12. self.ip_edit = QLineEdit()
  13. self.port_label = QLabel('端口号')
  14. self.port_edit = QLineEdit()
  15. self.username_label = QLabel('用户名')
  16. self.username_edit = QLineEdit()
  17. self.password_label = QLabel('密码')
  18. self.password_edit = QLineEdit()
  19. self.dbname_label = QLabel('数据库名称')
  20. self.dbname_edit = QLineEdit()
  21. self.sourcetabname_label = QLabel('源字典表名')
  22. self.sourcetabname_edit = QLineEdit()
  23. self.sourcecolumn_label = QLabel('源字典字段名')
  24. self.sourcecolumn_edit = QLineEdit()
  25. self.systabname_label = QLabel('本地字典表名')
  26. self.systabname_edit = QLineEdit()
  27. self.syscolumn_label = QLabel('本地字典字段名')
  28. self.syscolumn_edit = QLineEdit()
  29. # 创建连接按钮
  30. self.connect_button = QPushButton('连接')
  31. self.connect_button.clicked.connect(self.connectMySQL)
  32. self.mate_button = QPushButton('匹配')
  33. self.mate_button.clicked.connect(self.get_approx_score)
  34. # 创建布局
  35. vbox = QGridLayout()
  36. vbox.addWidget(self.ip_label,0,0)
  37. vbox.addWidget(self.ip_edit,0,1)
  38. vbox.addWidget(self.port_label)
  39. vbox.addWidget(self.port_edit)
  40. vbox.addWidget(self.username_label)
  41. vbox.addWidget(self.username_edit)
  42. vbox.addWidget(self.password_label)
  43. vbox.addWidget(self.password_edit)
  44. vbox.addWidget(self.dbname_label)
  45. vbox.addWidget(self.dbname_edit)
  46. vbox.addWidget(self.connect_button)
  47. vbox.addWidget(self.mate_button)
  48. vbox.addWidget(self.sourcetabname_label)
  49. vbox.addWidget(self.sourcetabname_edit)
  50. vbox.addWidget(self.sourcecolumn_label)
  51. vbox.addWidget(self.sourcecolumn_edit)
  52. vbox.addWidget(self.systabname_label)
  53. vbox.addWidget(self.systabname_edit)
  54. vbox.addWidget(self.syscolumn_label)
  55. vbox.addWidget(self.syscolumn_edit)
  56. self.setLayout(vbox)
  57. self.setWindowTitle('匹配近似值')
  58. self.show()
  59. def connectMySQL(self):
  60. # 获取输入框的值
  61. ip = self.ip_edit.text()
  62. port = self.port_edit.text()
  63. username = self.username_edit.text()
  64. password = self.password_edit.text()
  65. dbname = self.dbname_edit.text()
  66. # 连接MySQL
  67. try:
  68. conn = pymysql.connect(host=ip, port=int(port), user=username, password=password, db=dbname)
  69. print('MySQL连接成功')
  70. msg_box = QMessageBox(QMessageBox.Question, '确定', 'MySQL连接成功')
  71. msg_box.exec_()
  72. except Exception as e:
  73. print('MySQL连接失败:', e)
  74. msg_box = QMessageBox(QMessageBox.Question, '确定', 'MySQL连接失败')
  75. msg_box.exec_()
  76. def get_approx_score(self):
  77. ip = self.ip_edit.text()
  78. port = self.port_edit.text()
  79. username = self.username_edit.text()
  80. password = self.password_edit.text()
  81. dbname = self.dbname_edit.text()
  82. sourcetab = self.sourcetabname_edit.text()
  83. sourcecolumn = self.sourcecolumn_edit.text()
  84. systab = self.systabname_edit.text()
  85. syscolumn = self.syscolumn_edit.text()
  86. mydb = pymysql.connect(
  87. host=ip,
  88. port=int(port),
  89. user=username,
  90. password=password,
  91. database=dbname
  92. )
  93. # 查询表A
  94. mycursor = mydb.cursor()
  95. mycursor.execute("SELECT "+ sourcecolumn+" FROM "+sourcetab)
  96. result_a = mycursor.fetchall()
  97. mycursor.execute("SELECT "+ syscolumn+" FROM "+systab)
  98. result_b = mycursor.fetchall()
  99. for item_a in result_a:
  100. max_ratio = 0
  101. matched_item = ""
  102. for item_b in result_b:
  103. ratio = fuzz.ratio(item_a[0], item_b[0])
  104. if ratio > max_ratio:
  105. max_ratio = ratio
  106. matched_item = item_b[0]
  107. # 将最接近的匹配结果存放在表C
  108. sql_create_table ="CREATE TABLE if not exists `get_valueapporx_sorce` (`source_value` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,`value` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, `sorce` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL)"
  109. mycursor.execute(sql_create_table)
  110. sql = "INSERT INTO get_valueapporx_sorce (value, source_value,sorce) VALUES (%s, %s, %s)"
  111. val = (item_a[0], matched_item, max_ratio)
  112. mycursor.execute(sql, val)
  113. mydb.commit()
  114. msg_box = QMessageBox(QMessageBox.Question, '确定', '生成完成!请在get_valueapporx_sorce表中查看')
  115. msg_box.exec_()
  116. if __name__ == '__main__':
  117. app = QApplication(sys.argv)
  118. ex = MySQLConnector()
  119. sys.exit(app.exec_())

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/424192
推荐阅读
相关标签
  

闽ICP备14008679号