当前位置:   article > 正文

Python实现Excel转Html和读取Pdf为Text文本_python3 pdfminer.six pdf转html

python3 pdfminer.six pdf转html

一、Excel -> Html

需求如下:

    对现有excel文档做分类处理,处理了结果以邮件形式发送!

图片

分析:

    因为是excel文件处理,处理的结果需要展示在邮件中。

截图实现:打开处理的excel结果文件,截图,添加截图。因excel无法展示放弃。

文件转换:将excel结果文件转为邮件源码Html代码进行展示。

代码实现:

  1. import openpyxl
  2. from dominate.tags import *
  3. import dominate
  4. def excel_html(file_path):
  5. wd = openpyxl.load_workbook(file_path)
  6. ws = wd.active
  7. data = get_data(ws)
  8. return to_html(data)
  9. def get_data(wt):
  10. ws_data = []
  11. for row in wt:
  12. row_data = []
  13. for cell in row:
  14. if isinstance(cell, openpyxl.cell.cell.Cell):
  15. row_data.append(cell.value)
  16. else:
  17. row_data.append("merge")
  18. ws_data.append(row_data)
  19. data = []
  20. for row_index, row in enumerate(ws_data):
  21. row_data = []
  22. for cell_index, cell in enumerate(row):
  23. if cell != "merge":
  24. row_span = 1
  25. col_span = 1
  26. for cell_back in row[cell_index:]:
  27. if cell_back == "merge":
  28. row_span+=1
  29. continue
  30. break
  31. for row_back in ws_data[row_index+1:]:
  32. if row_back[cell_index] == "merge":
  33. col_span+=1
  34. continue
  35. break
  36. row_data.append({"value": cell, "rowspan":row_span, "colspan": col_span})
  37. data.append(row_data)
  38. return data
  39. def to_html(data):
  40. doc = dominate.document(title='excel-to-html')
  41. with doc:
  42. with div(id='excel_table').add(table(style="border-collapse: collapse; border-color: rgb(102, 102, 102); border-width: 1px; border-style: solid;")):
  43. for row in data:
  44. table_row = tr()
  45. for value in row:
  46. with table_row.add(td(
  47. style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);",
  48. rowspan=value.get("colspan"),
  49. align="center",
  50. colspan=value.get("rowspan"))):
  51. p(value.get("value"))
  52. return str(doc)
  53. if __name__ == '__main__':
  54. file_path = r"/Users/Young/Downloads/AS1-发票分类明细.xlsx"
  55. ret = excel_html(file_path)
  56. print(ret)

原excel文件

图片

转换后的代码:

  1. <!DOCTYPE html>
  2. <html>
  3. <head>
  4. <title>excel-to-html</title>
  5. </head>
  6. <body>
  7. <div id="excel_table">
  8. <table style="border-collapse: collapse; border-color: rgb(102, 102, 102); border-width: 1px; border-style: solid;">
  9. <tr>
  10. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  11. <p>OU</p>
  12. </td>
  13. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  14. <p>公司名称</p>
  15. </td>
  16. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  17. <p>发票类型</p>
  18. </td>
  19. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  20. <p>份数</p>
  21. </td>
  22. </tr>
  23. <tr>
  24. <td align="center" colspan="1" rowspan="2" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  25. <p>AS1</p>
  26. </td>
  27. <td align="center" colspan="1" rowspan="2" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  28. <p>XXX技术有限公司</p>
  29. </td>
  30. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  31. <p></p>
  32. </td>
  33. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  34. <p>8</p>
  35. </td>
  36. </tr>
  37. <tr>
  38. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  39. <p></p>
  40. </td>
  41. <td align="center" colspan="1" rowspan="1" style="padding-top: 1px; padding-right: 1px; padding-left: 1px; color: rgb(0, 0, 0); font-size: 14.6667px; font-weight: 700; font-style: normal; text-decoration: none solid rgb(0, 0, 0); font-family: 宋体; border: 1px solid rgb(102, 102, 102); background-color: rgba(0, 0, 0, 0);">
  42. <p>5</p>
  43. </td>
  44. </tr>
  45. </table>
  46. </div>
  47. </body>
  48. </html>

效果图:

一、Pdf -> Text

需求:

    获取pdf文件部分信息

处理方式:

    先转为text内容,然后通过正则获取关键信息

图片

pdf转文本代码:

  1. from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
  2. from pdfminer.pdfpage import PDFPage
  3. from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
  4. from pdfminer.layout import LAParams
  5. import io
  6. class PDFParser(object):
  7. """PDF 解析器类"""
  8. pdf_file_path: str
  9. def __init__(self, pdf_file_path):
  10. """
  11. :param pdf_file_path: pdf文件路径
  12. :return:
  13. """
  14. self.pdf_file_path = pdf_file_path
  15. def to_text(self):
  16. """
  17. 将pdf转化为文字
  18. :return: 文件文本信息
  19. """
  20. fp = open(self.pdf_file_path, 'rb')
  21. rsrcmgr = PDFResourceManager()
  22. retstr = io.StringIO()
  23. codec = 'utf-8'
  24. laparams = LAParams()
  25. # device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  26. device = TextConverter(rsrcmgr, retstr, laparams=laparams)
  27. # Create a PDF interpreter object.
  28. interpreter = PDFPageInterpreter(rsrcmgr, device)
  29. # Process each page contained in the document.
  30. for page in PDFPage.get_pages(fp):
  31. interpreter.process_page(page)
  32. fp.close()
  33. return retstr.getvalue()
  34. if __name__ == '__main__':
  35. file_path = r"/Users/Young/Downloads/数据库和缓存.pdf"
  36. pdf = PDFParser(file_path)
  37. ret = pdf.to_text()
  38. print(ret)

​​​​​​处理结果

图片

图片

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/249907
推荐阅读
相关标签
  

闽ICP备14008679号