当前位置:   article > 正文

Python解析Word文档的自动编号_python 获取wps编号

python 获取wps编号

关于自动编号的知识可以参考《在 Open XML WordprocessingML 中使用编号列表

链接:https://learn.microsoft.com/zh-cn/previous-versions/office/ee922775(v=office.14)

python-docx库并不能直接解析出Word文档的自动编号,因为原理较为复杂,但我们希望python能够读取自动编号对应的文本。

基本解析原理

为了测试验证,我们创建一个带有编号的文档进行测试,例如:

image-20240612115513012

然后我们先看看主文档中,对应的xml存储:

from docx import Document

doc = Document(r"编号测试1.docx")
for paragraph in doc.paragraphs:
    print(paragraph._element.xml)
    break
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

结果:

<w:p ...>
  <w:pPr>
    <w:numPr>
      <w:ilvl w:val="0"/>
      <w:numId w:val="1"/>
    </w:numPr>
    <w:bidi w:val="0"/>
    <w:ind w:left="0" w:leftChars="0" w:firstLine="0" w:firstLineChars="0"/>
    <w:rPr>
      <w:rFonts w:hint="eastAsia"/>
      <w:lang w:val="en-US" w:eastAsia="zh-CN"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:rFonts w:hint="eastAsia"/>
      <w:lang w:val="en-US" w:eastAsia="zh-CN"/>
    </w:rPr>
    <w:t>第一章</w:t>
  </w:r>
</w:p>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

在微软的文档中,说明了最重要的部分:

w:numPr 元素包含自动编号元素。w:ilvl 元素从零开始表示编号等级,w:numId 元素是编号部件的索引。

w:numId 为 0 值时 ,表示编号已经被删除段落不含列表项。

所以我们可以根据段落是否存在w:numPr并且w:numId的值不为0判断段落是否存在自动编号。

然后我们需要获取每个w:numId对应的自动编号状态,这个信息存储在zip压缩包的\word\numbering.xml文件中,可以参考微软文档的示例:

image-20240612125602707

w:numbering同时包含w:numw:abstractNum两种节点,其中w:num记录了 每个numId对应的abstractNumId,而w:abstractNum记录了每个abstractNumId对应的编号格式,包含了每个级别的编号样式信息。对于w:num,python-docx库已经帮我们解析好,可以直接读取,但w:abstractNum节点python-docx库却并未进行解析,只能我们自己进行xml解析。

可以通过如下代码获取每个numId对应的abstractNumId

from docx import Document

doc = Document(r"编号测试1.docx")
numbering_part = doc.part.numbering_part._element
numId2abstractId = {
    num.numId: num.abstractNumId.val for num in numbering_part.num_lst
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

接下来我们需要解析w:abstractNum节点,查阅python-docx库的源码可以知道,它使用lxml的etree进行xml解析。

初步解析代码为:

from docx.oxml.ns import qn

abstractNumId2style = {}
for abstractNumIdTag in numbering_part.findall(qn("w:abstractNum")):
    abstractNumId = abstractNumIdTag.get(qn("w:abstractNumId"))
    for lvlTag in abstractNumIdTag.findall(qn("w:lvl")):
        ilvl = lvlTag.get(qn("w:ilvl"))
        style = {tag.tag[tag.tag.rfind("}") + 1:]: tag.get(qn("w:val"))
                 for tag in lvlTag.xpath("./*[@w:val]", namespaces=numbering_part.nsmap)}
        abstractNumId2style[(int(abstractNumId), int(ilvl))] = style
print(abstractNumId2style)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

注意:docx.oxml.ns的qn函数可以将w:转换为对应的命名空间名称,但对于xpath表达式却无法正确处理,所以对于xpath表达式使用namespaces传入对应的命名空间。

除了上面的解析方法以外,还可以事先将节点的所有命名空间清除后再解析,清除代码如下:

def remove_namespace(node):
 node_tag = node.tag
 if '}' in node_tag:
     node.tag = node_tag[node_tag.rfind("}") + 1:]
 for attr_key in list(node.attrib):
     if '}' in attr_key:
         new_attr_key = attr_key[attr_key.rfind("}") + 1:]
         node.attrib[new_attr_key] = node.attrib.pop(attr_key)
 for child in node:
     remove_namespace(child)
 return node
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
'
运行

这样可以递归消除目标节点所有子节点的命名空间。

可以每个类别每个级别的自动编号的属性信息:

{(0, 0): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.', 'lvlJc': 'left'}, (0, 1): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.', 'lvlJc': 'left'}, (0, 2): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.', 'lvlJc': 'left'}, (0, 3): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.%4.', 'lvlJc': 'left'}, (0, 4): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.%4.%5.', 'lvlJc': 'left'}, (0, 5): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.%4.%5.%6.', 'lvlJc': 'left'}, (0, 6): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.%4.%5.%6.%7.', 'lvlJc': 'left'}, (0, 7): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.%4.%5.%6.%7.%8.', 'lvlJc': 'left'}, (0, 8): {'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.%4.%5.%6.%7.%8.%9.', 'lvlJc': 'left'}}
  • 1

当然我们只测试了最基本的数值型自动编号,有些自动编号对应的节点没有直接的w:numFmt节点,解析代码还需针对性调整。

微软的文档中提到,对多级列表的某一级列表进行特殊设定时,w:num内会出现w:lvlOverride节点,但本人使用wps反复测试过后并没有出现。估计这种格式的xml只会在老版的office中出现,而且我们也不会故意在多级列表的某一级进行特殊设定,所以我们不考虑这种情况。

还需要考虑 w:suff 元素控制的列表后缀,即列表项与段落之间的空白内容,有可能为制表符和空格,也可以什么都没有。处理代码为:

{"space": " ", "nothing": ""}.get(style.get("suff"), "\t")
  • 1

多级编号处理

首先尝试读取每个段落对应的自动编号样式:

for paragraph in doc.paragraphs:
    numpr = paragraph._element.pPr.numPr
    if numpr is not None and numpr.numId.val != 0:
        numId = numpr.numId.val
        ilvl = numpr.ilvl.val
        abstractId = numId2abstractId[numId]
        style = abstractNumId2style[(abstractId, ilvl)]
        print(style)
    print(paragraph.text)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

结果:

{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.', 'lvlJc': 'left'}
第一章
{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.', 'lvlJc': 'left'}
第一节
{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.', 'lvlJc': 'left'}
第二节
{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.', 'lvlJc': 'left'}
第一条
{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.%2.%3.', 'lvlJc': 'left'}
第二条
{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.', 'lvlJc': 'left'}
第二章
{'start': '1', 'numFmt': 'decimal', 'lvlText': '%1.', 'lvlJc': 'left'}
第三章
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

我们需要一个计数器来记录每个样式出现的次数,从而生成其对应的编号。

cache = {}
for paragraph in doc.paragraphs:
    numpr = paragraph._element.pPr.numPr
    lvlText = ""
    if numpr is not None and numpr.numId.val != 0:
        numId = numpr.numId.val
        ilvl = numpr.ilvl.val
        abstractId = numId2abstractId[numId]
        style = abstractNumId2style[(abstractId, ilvl)]
        if (abstractId, ilvl) in cache:
            cache[(abstractId, ilvl)] += 1
        else:
            cache[(abstractId, ilvl)] = int(style["start"])
        lvlText = style.get("lvlText")
        for i in range(0, ilvl + 1):
            lvlText = lvlText.replace(f'%{i + 1}', str(cache[(abstractId, i)]))
        suff_text = {"space": " ", "nothing": ""}.get(style.get("suff"), "\t")
        lvlText += suff_text
    print(lvlText + paragraph.text)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

结果:

1.	第一章
1.1.	第一节
1.2.	第二节
1.2.1.	第一条
1.2.2.	第二条
2.	第二章
3.	第三章
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

各种其他类型的编号生成

为了尽量多的支持更多类型的编号,我创建了如下测试文件:

image-20240612161333048

我们没有必要获取对应的圆圈数字,圆圈就获取对应的整数。

除了三种日文编号,上面的示例几乎包含所有的编号类型。需要注意三位数以上的数字格式,其xml有些特殊,例如:

<w:lvl>
  <w:start w:val="1"/>
  <mc:AlternateContent>
    <mc:Choice Requires="w14">
      <w:numFmt w:val="custom" w:format="001, 002, 003, ..."/>
    </mc:Choice>
    <mc:Fallback>
      <w:numFmt w:val="decimal"/>
    </mc:Fallback>
  </mc:AlternateContent>
  <w:suff w:val="space"/>
  <w:lvlText w:val="%1"/>
  <w:lvlJc w:val="left"/>
  <w:pPr>
    <w:tabs>
      <w:tab w:val="left" w:pos="0"/>
    </w:tabs>
  </w:pPr>
  <w:rPr>
    <w:rFonts w:hint="default"/>
  </w:rPr>
</w:lvl>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

基于此,解析格式的代码也作出如下调整:

abstractNumId2style = {}
for abstractNumIdTag in numbering_part.findall(qn("w:abstractNum")):
    abstractNumId = abstractNumIdTag.get(qn("w:abstractNumId"))
    for lvlTag in abstractNumIdTag.findall(qn("w:lvl")):
        ilvl = lvlTag.get(qn("w:ilvl"))
        style = {tag.tag[tag.tag.rfind("}") + 1:]: tag.get(qn("w:val"))
                 for tag in lvlTag.xpath("./*[@w:val]", namespaces=numbering_part.nsmap)}
        if "numFmt" not in style:
            numFmtVal = lvlTag.xpath("./mc:AlternateContent/mc:Fallback/w:numFmt/@w:val",
                                     namespaces=numbering_part.nsmap)
            if numFmtVal and numFmtVal[0] == "decimal":
                numFmt_format = lvlTag.xpath("./mc:AlternateContent/mc:Choice/w:numFmt/@w:format",
                                             namespaces=numbering_part.nsmap)
                if numFmt_format:
                    style["numFmt"] = "decimal" + numFmt_format[0].split(",")[0]
        if style.get("numFmt") == "decimalZero":
            style["numFmt"] = "decimal01"
        abstractNumId2style[(int(abstractNumId), int(ilvl))] = style
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

目前只发现这种基于decimal的格式,所以只针对这种自定义格式处理,其他类型的统一认为是没有自动编号。另外既然三位数的整数格式已经被我们命名为decimal001,那么也将二位数的decimalZero修改为decimal01

目前测试出这个文件有以下这些numFmt

bullet,cardinalText,chineseCounting,chineseLegalSimplified,decimal,decimalEnclosedCircleChinese,ideographTraditional,ideographZodiac,lowerLetter,lowerRoman,ordinal,ordinalText,upperLetter,upperRoman
  • 1

下面我们预先选择一些可能比较复杂的转换编写相应的函数:

正整数转换为大写字母

代码如下:

def int2upperLetter(num):
    result = []
    while num > 0:
        num -= 1
        remainder = num % 26
        result.append(chr(remainder + ord('A')))
        num //= 26
    return "".join(reversed(result))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
'
运行

正整数转换为罗马数字

def int2upperRoman(num):
    t = [
        (1000, 'M'), (900, 'CM'), (500, 'D'),
        (400, 'CD'), (100, 'C'), (90, 'XC'),
        (50, 'L'), (40, 'XL'),  (10, 'X'),
        (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')
    ]
    roman_num = ''
    i = 0
    while num > 0:
        val, syb = t[i]
        for _ in range(num // val):
            roman_num += syb
            num -= val
        i += 1
    return roman_num
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
'
运行

正整数转换为英文基数字

def int2cardinalText(num):
    if not isinstance(num, int) or num < 0 or num > 999999999999:
        raise ValueError(
            "Invalid number: must be a positive integer within four digits")
    base = ["Zero", "One", "Two", "Three", "Four", "Five", "Six",
            "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen",
            "Fifteen", "Sixteen", "Seventeen", "Eighteen", "Nineteen"]
    tens = ["", "", "Twenty", "Thirty", "Fourty",
            "Fifty", "Sixty", "Seventy", "Eighty", "Ninety"]
    thousands = ["", "Thousand", "Million", "Billion"]

    def two_digits(n):
        if n < 20:
            return base[n]
        ten, unit = divmod(n, 10)
        if unit == 0:
            return f"{tens[ten]}"
        else:
            return f"{tens[ten]}-{base[unit]}"

    def three_digits(n):
        hundred, rest = divmod(n, 100)
        if hundred == 0:
            return two_digits(rest)
        result = f"{base[hundred]} hundred "
        if rest > 0:
            result += two_digits(rest)
        return result.strip()
    if num < 99:
        return two_digits(num)
    chunks = []
    while num > 0:
        num, remainder = divmod(num, 1000)
        chunks.append(remainder)
    words = []
    for i in range(len(chunks) - 1, -1, -1):
        if chunks[i] == 0:
            continue
        chunk_word = three_digits(chunks[i])
        if thousands[i]:
            chunk_word += f" {thousands[i]}"
        words.append(chunk_word)
    words = " ".join(words).lower()
    return words[0].upper()+words[1:]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
'
运行

正整数转换为英文序数字

def int2ordinalText(num):
    if not isinstance(num, int) or num < 0 or num > 999999:
        raise ValueError(
            "Invalid number: must be a positive integer within four digits")
    base = ["Zero", "One", "Two", "Three", "Four", "Five", "Six",
            "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen",
            "Fifteen", "Sixteen", "Seventeen", "Eighteen", "Nineteen"]
    baseth = ['Zeroth', 'First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
              'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth', 'Fourteenth',
              'Fifteenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 'Nineteenth', 'Twentieth']
    tens = ["", "", "Twenty", "Thirty", "Fourty",
            "Fifty", "Sixty", "Seventy", "Eighty", "Ninety"]
    tensth = ["", "", "Twentieth", "Thirtieth", "Fortieth",
              "Fiftieth", "Sixtieth", "Seventieth", "Eightieth", "Ninetieth"]

    def two_digits(n):
        if n <= 20:
            return baseth[n]
        ten, unit = divmod(n, 10)
        result = tensth[ten]
        if unit != 0:
            result = f"{tens[ten]}-{baseth[unit]}"
        return result

    thousand, num = divmod(num, 1000)
    result = []
    if thousand > 0:
        if num == 0:
            return f"{int2cardinalText(thousand)} thousandth"
        result.append(f"{int2cardinalText(thousand)} thousand")
    hundred, num = divmod(num, 100)
    if hundred > 0:
        if num == 0:
            result.append(f"{base[hundred]} hundredth")
            return " ".join(result)
        result.append(f"{base[hundred]} hundred")
    result.append(two_digits(num))
    result = " ".join(result).lower()
    return result[0].upper() + result[1:]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
'
运行

会复用前面的基数字转换规则。

正整数转换为中文数字

import re


def int2Chinese(num, ch_num, units):
    if not (0 <= num <= 99999999):
        raise ValueError("仅支持小于一亿以内的正整数")

    def int2Chinese_in(num, ch_num, units):
        if not (0 <= num <= 9999):
            raise ValueError("仅支持小于一万以内的正整数")
        result = [ch_num[int(i)] + unit for i, unit in zip(reversed(str(num).zfill(4)), units)]
        result = "".join(reversed(result))
        zero_char = ch_num[0]
        result = re.sub(f"(?:{zero_char}[{units}])+", zero_char, result)
        result = result.rstrip(units[0])
        if result != zero_char:
            result = result.rstrip(zero_char)
        if result.lstrip(zero_char).startswith("一十"):
            result = result.replace("一", "")
        return result

    if num < 10000:
        result = int2Chinese_in(num, ch_num, units)
    else:
        left = num // 10000
        right = num % 10000
        result = int2Chinese_in(left, ch_num, units) + "万" + int2Chinese_in(right, ch_num, units)
    if result != ch_num[0]:
        result = result.strip(ch_num[0])
    return result


def int2ChineseCounting(num):
    return int2Chinese(num, ch_num='〇一二三四五六七八九', units='个十百千')


def int2ChineseLegalSimplified(num):
    return int2Chinese(num, ch_num='零壹贰叁肆伍陆柒捌玖', units='个拾佰仟')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
'
运行

整体封装并改进

最终封装成为一个类:

import logging
import re
from io import BytesIO

from PIL import Image
from docx import Document, ImagePart
from docx.oxml.ns import qn, nsmap
from docx.text.paragraph import Paragraph
from functools import lru_cache


class WithNumberDocxReader:
    ideographTraditional = "甲乙丙丁戊己庚辛壬癸"
    ideographZodiac = "子丑寅卯辰巳午未申酉戌亥"

    def __init__(self, docx, gap_text="\t"):
        self.parts = []
        self.docx = Document(docx)
        nsmap.update(self.docx.element.nsmap)
        self.numId2style = self.get_style_data()
        self.gap_text = gap_text
        self.cnt = {}
        self.cache = {}
        self.result = []

    @property
    def texts(self):
        if self.result:
            return self.result.copy()
        self.clear()
        for paragraph in self.paragraphs:
            number_text = self.get_number_text(paragraph)
            line = number_text + paragraph.text.strip()
            if not line:
                continue
            self.result.append(line)
        return self.result.copy()

    def clear(self):
        self.result.clear()
        self.cnt.clear()
        self.cache.clear()

    @property
    @lru_cache
    def paragraphs(self):
        body = self.docx.element.body
        result = []
        for p in body.xpath('w:p | w:sdt/w:sdtContent/w:p | w:p//v:textbox//w:p'):
            result.append(Paragraph(p, body))
        return result

    @property
    def images(self):
        if self.parts:
            return self.parts.copy()
        related_parts = self.docx.part.related_parts
        for i, paragraph in enumerate(self.paragraphs, 1):
            for run in paragraph.runs:
                for drawing in run.element.drawing_lst:
                    rid = drawing.xpath(".//a:blip/@r:embed")
                    if not rid or rid[0] not in related_parts:
                        continue
                    part = related_parts[rid[0]]
                    if isinstance(part, ImagePart):
                        self.parts.append((i, part.partname, part.blob))
        return self.parts

    def get_style_data(self):
        try:
            numbering_part = self.docx.part.numbering_part._element
            abstractId2numId = {num.abstractNumId.val: num.numId for num in numbering_part.num_lst}
            numId2style = {}
            for abstractNumIdTag in numbering_part.findall(qn("w:abstractNum")):
                abstractNumId = abstractNumIdTag.get(qn("w:abstractNumId"))
                numId = abstractId2numId[int(abstractNumId)]
                for lvlTag in abstractNumIdTag.findall(qn("w:lvl")):
                    ilvl = lvlTag.get(qn("w:ilvl"))
                    style = {tag.tag[tag.tag.rfind("}") + 1:]: tag.get(qn("w:val"))
                             for tag in lvlTag.xpath("./*[@w:val]", namespaces=nsmap)}
                    if "numFmt" not in style:
                        numFmtVal = lvlTag.xpath("./mc:AlternateContent/mc:Fallback/w:numFmt/@w:val",
                                                 namespaces=nsmap)
                        if numFmtVal and numFmtVal[0] == "decimal":
                            numFmt_format = lvlTag.xpath("./mc:AlternateContent/mc:Choice/w:numFmt/@w:format",
                                                         namespaces=nsmap)
                            if numFmt_format:
                                style["numFmt"] = "decimal" + numFmt_format[0].split(",")[0]
                    if style.get("numFmt") == "decimalZero":
                        style["numFmt"] = "decimal01"
                    numId2style[(numId, int(ilvl))] = style
            return numId2style
        except Exception as e:
            logging.warning("读取自动编号出错:" + e.__class__.__name__)

    @staticmethod
    def int2upperLetter(num):
        result = []
        while num > 0:
            num -= 1
            remainder = num % 26
            result.append(chr(remainder + ord('A')))
            num //= 26
        return "".join(reversed(result))

    @staticmethod
    def int2upperRoman(num):
        t = [
            (1000, 'M'), (900, 'CM'), (500, 'D'),
            (400, 'CD'), (100, 'C'), (90, 'XC'),
            (50, 'L'), (40, 'XL'), (10, 'X'),
            (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')
        ]
        roman_num = ''
        i = 0
        while num > 0:
            val, syb = t[i]
            for _ in range(num // val):
                roman_num += syb
                num -= val
            i += 1
        return roman_num

    @staticmethod
    def int2cardinalText(num):
        if not isinstance(num, int) or num < 0 or num > 999999999:
            raise ValueError(
                "Invalid number: must be a positive integer within four digits")
        base = ["Zero", "One", "Two", "Three", "Four", "Five", "Six",
                "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen",
                "Fifteen", "Sixteen", "Seventeen", "Eighteen", "Nineteen"]
        tens = ["", "", "Twenty", "Thirty", "Fourty",
                "Fifty", "Sixty", "Seventy", "Eighty", "Ninety"]
        thousands = ["", "Thousand", "Million", "Billion"]

        def two_digits(n):
            if n < 20:
                return base[n]
            ten, unit = divmod(n, 10)
            if unit == 0:
                return f"{tens[ten]}"
            else:
                return f"{tens[ten]}-{base[unit]}"

        def three_digits(n):
            hundred, rest = divmod(n, 100)
            if hundred == 0:
                return two_digits(rest)
            result = f"{base[hundred]} hundred "
            if rest > 0:
                result += two_digits(rest)
            return result.strip()

        if num < 99:
            return two_digits(num)
        chunks = []
        while num > 0:
            num, remainder = divmod(num, 1000)
            chunks.append(remainder)
        words = []
        for i in range(len(chunks) - 1, -1, -1):
            if chunks[i] == 0:
                continue
            chunk_word = three_digits(chunks[i])
            if thousands[i]:
                chunk_word += f" {thousands[i]}"
            words.append(chunk_word)
        words = " ".join(words).lower()
        return words[0].upper() + words[1:]

    @staticmethod
    def int2ordinalText(num):
        if not isinstance(num, int) or num < 0 or num > 999999:
            raise ValueError(
                "Invalid number: must be a positive integer within four digits")
        base = ["Zero", "One", "Two", "Three", "Four", "Five", "Six",
                "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen",
                "Fifteen", "Sixteen", "Seventeen", "Eighteen", "Nineteen"]
        baseth = ['Zeroth', 'First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
                  'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth', 'Fourteenth',
                  'Fifteenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 'Nineteenth', 'Twentieth']
        tens = ["", "", "Twenty", "Thirty", "Fourty",
                "Fifty", "Sixty", "Seventy", "Eighty", "Ninety"]
        tensth = ["", "", "Twentieth", "Thirtieth", "Fortieth",
                  "Fiftieth", "Sixtieth", "Seventieth", "Eightieth", "Ninetieth"]

        def two_digits(n):
            if n <= 20:
                return baseth[n]
            ten, unit = divmod(n, 10)
            result = tensth[ten]
            if unit != 0:
                result = f"{tens[ten]}-{baseth[unit]}"
            return result

        thousand, num = divmod(num, 1000)
        result = []
        if thousand > 0:
            if num == 0:
                return f"{WithNumberDocxReader.int2cardinalText(thousand)} thousandth"
            result.append(f"{WithNumberDocxReader.int2cardinalText(thousand)} thousand")
        hundred, num = divmod(num, 100)
        if hundred > 0:
            if num == 0:
                result.append(f"{base[hundred]} hundredth")
                return " ".join(result)
            result.append(f"{base[hundred]} hundred")
        result.append(two_digits(num))
        result = " ".join(result).lower()
        return result[0].upper() + result[1:]

    @staticmethod
    def int2Chinese(num, ch_num, units):
        if not (0 <= num <= 99999999):
            raise ValueError("仅支持小于一亿以内的正整数")

        def int2Chinese_in(num, ch_num, units):
            if not (0 <= num <= 9999):
                raise ValueError("仅支持小于一万以内的正整数")
            result = [ch_num[int(i)] + unit for i, unit in zip(reversed(str(num).zfill(4)), units)]
            result = "".join(reversed(result))
            zero_char = ch_num[0]
            result = re.sub(f"(?:{zero_char}[{units}])+", zero_char, result)
            result = result.rstrip(units[0])
            if result != zero_char:
                result = result.rstrip(zero_char)
            if result.lstrip(zero_char).startswith("一十"):
                result = result.replace("一", "")
            return result

        if num < 10000:
            result = int2Chinese_in(num, ch_num, units)
        else:
            left = num // 10000
            right = num % 10000
            result = int2Chinese_in(left, ch_num, units) + "万" + int2Chinese_in(right, ch_num, units)
        if result != ch_num[0]:
            result = result.strip(ch_num[0])
        return result

    @staticmethod
    def int2ChineseCounting(num):
        return WithNumberDocxReader.int2Chinese(num, ch_num='〇一二三四五六七八九', units='个十百千')

    @staticmethod
    def int2ChineseLegalSimplified(num):
        return WithNumberDocxReader.int2Chinese(num, ch_num='零壹贰叁肆伍陆柒捌玖', units='个拾佰仟')

    def get_number_text(self, paragraph):
        if self.numId2style is None:
            return ""
        pr = paragraph._element.pPr
        if pr is None:
            return ""
        numpr = pr.numPr
        if numpr is None or numpr.numId.val == 0:
            return ""
        numId = numpr.numId.val
        ilvl = numpr.ilvl.val
        style = self.numId2style[(numId, ilvl)]
        numFmt: str = style.get("numFmt")
        lvlText = style.get("lvlText")
        isTxbxContent = paragraph._element.getparent().tag.endswith("txbxContent")
        for a, b, c in list(self.cnt.keys()):
            if a == numId and c == isTxbxContent and b > ilvl:
                del self.cnt[(a, b, c)]
        pos_key = (numId, ilvl, isTxbxContent)
        if pos_key in self.cnt:
            self.cnt[pos_key] += 1
        else:
            self.cnt[pos_key] = int(style["start"])
        pos = self.cnt[pos_key]
        num_text = str(pos)
        if numFmt.startswith('decimal'):
            num_text = num_text.zfill(numFmt.count("0") + 1)
        elif numFmt == 'upperRoman':
            num_text = self.int2upperRoman(pos)
        elif numFmt == 'lowerRoman':
            num_text = self.int2upperRoman(pos).lower()
        elif numFmt == 'upperLetter':
            num_text = self.int2upperLetter(pos)
        elif numFmt == 'lowerLetter':
            num_text = self.int2upperLetter(pos).lower()
        elif numFmt == 'ordinal':
            num_text = f"{pos}{'th' if 11 <= pos <= 13 else {1: 'st', 2: 'nd', 3: 'rd'}.get(pos % 10, 'th')}"
        elif numFmt == 'cardinalText':
            num_text = self.int2cardinalText(pos)
        elif numFmt == 'ordinalText':
            num_text = self.int2ordinalText(pos)
        elif numFmt == 'ideographTraditional':
            if 1 <= pos <= 10:
                num_text = self.ideographTraditional[pos - 1]
        elif numFmt == 'ideographZodiac':
            if 1 <= pos <= 12:
                num_text = self.ideographZodiac[pos - 1]
        elif numFmt == 'chineseCounting':
            num_text = self.int2ChineseCounting(pos)
        elif numFmt == 'chineseLegalSimplified':
            num_text = self.int2ChineseLegalSimplified(pos)
        elif numFmt == 'decimalEnclosedCircleChinese':
            pass
        self.cache[pos_key] = num_text
        for i in range(0, ilvl + 1):
            lvlText = lvlText.replace(f'%{i + 1}', self.cache.get((numId, i, isTxbxContent), ""))
        suff_text = {"space": " ", "nothing": ""}.get(style.get("suff"), self.gap_text)
        lvlText += suff_text
        return lvlText


if __name__ == '__main__':
    doc = WithNumberDocxReader(r"编号测试1.docx", " ")
    for text in doc.texts:
        print(text)
    for i, name, image_bytes in doc.images:
        print(i, name)
        image = Image.open(BytesIO(image_bytes))
        image.show()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294
  • 295
  • 296
  • 297
  • 298
  • 299
  • 300
  • 301
  • 302
  • 303
  • 304
  • 305
  • 306
  • 307
  • 308
  • 309
  • 310
  • 311
  • 312
  • 313
  • 314
  • 315
  • 316
  • 317
  • 318

调用测试:

if __name__ == '__main__':
    doc = WithNumberDocxReader(r"编号测试2.docx", "")
    for text in doc.texts:
        print(text)
  • 1
  • 2
  • 3
  • 4

顺利达到打印出对应的字符:

点符
1.十进制数
01.零加十进制数
001 零零加十进制数
0001 零零零加十进制数
I 大写罗马数字 (I)
II 大写罗马数字 (II)
i 小写罗马数字
A.大写字母A
a 小写字母 (a)
0th 序数 (1st, 2nd, 3rd)
Twelve 基数字 (One, Two Three)
First 序数字 (First, Second, Third)

癸 甲乙丙丁戊己庚辛壬癸
壹 中文大写数字
10 圆圈数字
子 子丑寅卯辰巳午未申酉戌亥

第一章 中文数字
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/码创造者/article/detail/1001642
推荐阅读
相关标签
  

闽ICP备14008679号