赞
踩
https://lxml.de/api/lxml-module.html
本系列主要围绕lxml的etree模块来介绍。
The lxml.etree module implements the extended ElementTree API for XML.
from lxml import etree
test = etree.Element('root', attrib={'Test': 'Try'}) # 返回Element对象
print(test)
<Element root at 0x54dfb08>
a = etree.SubElement(test, 'a', attrib={'x': '123'})
print(a)
<Element a at 0x5041e88>
print(etree.tostring(test,pretty_print=True)) # 格式化输出,提高可读性
res = etree.tostring(test)
print(res)
print('type(res) = ', type(res)) # etree.tostring()在python2中返回字符串类型,在python3中返回<class 'bytes'>,可通过decode解码为str
print(res.decode('utf-8'))
print("type(res.decode('utf-8')) = ", type(res.decode('utf-8')))
b'<root Test="Try">\n <a x="123"/>\n</root>\n'
b'<root Test="Try"><a x="123"/></root>'
type(res) = <class 'bytes'>
<root Test="Try"><a x="123"/></root>
type(res.decode('utf-8')) = <class 'str'>
etree.dump(test)
<root Test="Try">
<a x="123"/>
</root>
etree.dump(test, pretty_print=True)
<root Test="Try">
<a x="123"/>
</root>
etree.iselement(test) # 判断是否为element对象
True
etree.get_default_parser()
<lxml.etree.XMLParser at 0x4fecb90>
set_default_parser(parser=None)
设置默认解析器
Set a default parser for the current thread. This parser is used globally whenever no parser is supplied to the various parse functions of the lxml API. If this function is called without a parser (or if it is None), the default parser is reset to the original configuration.
Note that the pre-installed default parser is not thread-safe. Avoid the default parser in multi-threaded environments. You can create a separate parser for each thread explicitly or use a parser pool.
xml_str = """
<root>
<a x='123'>aText
<b/>
<c/>
<b/>
</a>hello
<a y='3'>Text
<b/>
<c/>
<b/>
</a>
</root>
"""
root_xml = etree.fromstring(xml_str) # 返回根节点
print(root_xml)
print(type(root_xml))
print(etree.iselement(root_xml)) # 判断是否为element对象
<Element root at 0x54dfa88>
<class 'lxml.etree._Element'>
True
root_xml.tag
'root'
sub_elem = root_xml.find('a')
sub_elem
<Element a at 0x54f00c8>
sub_elem.text
'aText\n '
sub_elem.tail
'hello\n '
sub_elem.attrib
{'x': '123'}
# 设置etree.XMLParser(remove_blank_text=True)后,输出时pretty_print参数才有效。
parser = etree.XMLParser(remove_blank_text=True)
my_et = etree.ElementTree(element=test, parser=parser)
my_et
<lxml.etree._ElementTree at 0x54f0548>
html = etree.HTML(xml_str)
html
<Element html at 0x5036c08>
etree.dump(html)
<html>
<body><root>
<a x="123">aText
<b/>
<c/>
<b/>
</a>hello
<a y="3">Text
<b/>
<c/>
<b/>
</a>
</root>
</body>
</html>
etree.iselement(html)
True
表格解读:
从三者的返回值的类型上可以看到,etree.HTML()和etree.fromstring()都是属于同一种“class类”,即Element类, 这个类支持使用xpath。也就说etree.tostring()是“字节bytes类”,不能使用xpath!
从根节点看,etree.HTML()的文档格式已经变成html类型,所以根节点自然就是html标签【这属于html方面的知识点,不清楚的朋友可以查资料了解】
但是,etree.fromstring()的根节点还是原文档中的根节点,说明这种格式化方式并不改变原文档的整体结构,我比较推荐使用这种方式进行文档格式化,因为这样有利于我们有时使用xpath的绝对路径方式查找信息!
而etree.tostring()是没有所谓的根节点的,因为这个方法得到的文档类型是‘bytes’类,其实里面的tostring,我们可以理解成to_bytes,这样可以帮助理解记忆。
从编码方式上看,etree.HTML()和etree.fromstring()的括号内参数都要以“utf-8”的方式进行编码!表格中的X是表示用read()方法之后的原文档内容。
XML(text, parser=None, base_url=None)
Parses an XML document or fragment from a string constant. Returns the root node (or the result returned by a parser target). This function can be used to embed “XML literals” in Python code,
从字符串常量解析XML文档或片段。返回根节点(或解析器目标返回的结果)。此函数可用于在Python代码中嵌入“XML文本”
To override the parser with a different XMLParser you can pass it to the parser keyword argument.
The base_url keyword argument allows to set the original base URL of the document to support relative Paths when looking up external entities (DTD, XInclude, …).
xml_test = etree.XML("<root><test/></root>")
xml_test
<Element root at 0x54f0b08>
etree.dump(xml_test)
<root>
<test/>
</root>
etree.iselement(xml_test)
True
parse(source, parser=None, base_url=None)
Return an ElementTree object loaded with source elements. If no parser is provided as second argument, the default parser is used.
返回加载了源元素的ElementTree对象。如果没有提供解析器作为第二个参数,则使用默认解析器。
The source can be any of the following:
a file name/path
a file object
a file-like object
a URL using the HTTP or FTP protocol
To parse from a string, use the fromstring() function instead.
Note that it is generally faster to parse from a file path or URL than from an open file object or file-like object. Transparent decompression from gzip compressed sources is supported (unless explicitly disabled in libxml2).
The base_url keyword allows setting a URL for the document when parsing from a file-like object. This is needed when looking up external entities (DTD, XInclude, …) with relative paths.
test_parse = etree.parse('./sample.xml') # 返回ElementTree对象
print(test_parse)
print(etree.iselement(test_parse)) # 判断是否为element对象
<lxml.etree._ElementTree object at 0x00000000054F0C88>
False
strip_attributes(tree_or_element, *attribute_names)
Delete all attributes with the provided attribute names from an Element (or ElementTree) and its descendants.
从Element对象(或ElementTree对象)及其后代中删除具有所提供属性名称的所有属性。
Attribute names can contain wildcards as in _Element.iter.
属性名可以包含通配符,如Example中所示_元素iter.
Example usage:
strip_attributes(root_element,
‘simpleattr’,
‘{http://some/ns}attrname’,
‘{http://other/ns}*’)
root_elem = test_parse.getroot()
print(root_elem)
print(etree.iselement(root_elem)) # 判断是否为element对象
<Element TradingAccounts at 0x54fc1c8>
True
etree.dump(root_elem)
<TradingAccounts>
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
<Strategies>
<Strategy name="CTA01" trade="true" commission="flase"/>
<Strategy name="CTA02" trade="true" commission="flase"/>
<Strategy name="ALPHA"/>
</Strategies>
<Accounts>
<Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
<Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
<Strategy name="CTA02" num="10" prior="2" id="998"/>
</Account>
<Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
<Strategy name="CTA01" num="2" prior="1" id="999">this is text
<Type id="10" name="FOF"/>
same text
</Strategy>
<Strategy name="CTA02" num="5" prior="2" id="1000"/>
</Account>
<Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
<Strategy name="CTA01" num="5" prior="1" id="1001">
<Commission id="20" rate="0.01"/>
<Slip param="1"/>
</Strategy>
<Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
</Account>
</Accounts>
</TradingAccounts>
etree.strip_attributes(root_elem, 'commission', 'name')
etree.dump(root_elem)
<TradingAccounts>
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
<Strategies>
<Strategy trade="true"/>
<Strategy trade="true"/>
<Strategy/>
</Strategies>
<Accounts>
<Account max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
<Strategy num="3" prior="1" id="997"/>first strategy
<Strategy num="10" prior="2" id="998"/>
</Account>
<Account max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
<Strategy num="2" prior="1" id="999">this is text
<Type id="10"/>
same text
</Strategy>
<Strategy num="5" prior="2" id="1000"/>
</Account>
<Account max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
<Strategy num="5" prior="1" id="1001">
<Commission id="20" rate="0.01"/>
<Slip param="1"/>
</Strategy>
<Strategy num="6" prior="2" id="1002"/>last strategy
</Account>
</Accounts>
</TradingAccounts>
# 如果找不到要删除的属性名,也不会报错
etree.strip_attributes(root_elem, 'xxyyzz')
etree.dump(root_elem)
strip_elements(tree_or_element, with_tail=True, *tag_names)
Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.
从树或子树中删除具有所提供标记名的所有元素。这将删除元素及其整个子树,包括它们的所有属性、文本内容和子体。它还将删除元素的尾部文本,除非您显式地将with_tail关键字参数选项设置为False。
Tag names can contain wildcards as in _Element.iter.
标记名可以包含通配符,如Example中所示_元素iter.
Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. If you want to include the root element, check its tag name directly before even calling this function.
注意,这不会删除传递的元素(或ElementTree根元素),即使它匹配。它只会对待它的后代。如果要包含根元素,请在调用此函数之前直接检查其标记名。
Example usage:
strip_elements(some_element,
‘simpletagname’, # non-namespaced tag
‘{http://some/ns}tagname’, # namespaced tag
‘{http://some/other/ns}*’ # any tag from a namespace
lxml.etree.Comment # comments
)
root_elem1 = test_parse.getroot()
etree.strip_elements(root_elem1, 'Strategy')
etree.dump(root_elem1)
<TradingAccounts>
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
<Strategies>
</Strategies>
<Accounts>
<Account max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
</Account>
<Account max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
</Account>
<Account max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
</Account>
</Accounts>
</TradingAccounts>
# 如果提供的tag不存在,也不会报错
etree.strip_elements(root_elem1, 'hahaha')
etree.dump(root_elem1)
strip_tags(tree_or_element, *tag_names)
Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their attributes, but not their text/tail content or descendants. Instead, it will merge the text content and children of the element into its parent.
从树或子树中删除具有所提供标记名的所有元素。这将移除元素及其属性,但不会移除其文本/尾部内容或子体。相反,它将把元素的文本内容和子元素合并到其父元素中。
Tag names can contain wildcards as in _Element.iter.
标记名可以包含通配符,如Example中所示_元素iter.
Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.
注意,这不会删除传递的元素(或ElementTree根元素),即使它匹配。它只会对待它的后代。
Example usage:
strip_tags(some_element,
‘simpletagname’, # non-namespaced tag
‘{http://some/ns}tagname’, # namespaced tag
‘{http://some/other/ns}*’ # any tag from a namespace
Comment # comments (including their text!)
)
root_elem2 = test_parse.getroot()
etree.strip_elements(root_elem2, 'Strategy')
etree.dump(root_elem2)
<TradingAccounts>
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
<Strategies>
</Strategies>
<Accounts>
<Account max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
</Account>
<Account max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
</Account>
<Account max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
</Account>
</Accounts>
</TradingAccounts>
Element是XML处理的核心类,Element对象可以直观的理解为XML的节点,大部分XML节点的处理都是围绕该类进行的。这部分包括三个内容:节点的操作、节点属性的操作、节点内文本的操作。下面将结合对xml的增删改查来进一步介绍。属性:
attrib
Element attribute dictionary. Where possible, use get(), set(), keys(), values() and items() to access element attributes.
base
The base URI of the Element (xml:base or HTML base URL). None if the base URI is unknown.
nsmap
Namespace prefix->URI mapping known in the context of this Element. This includes all namespace declarations of the parents.
prefix
Namespace prefix or None.
sourceline
Original line number as found by the parser or None if unknown.
tag
Element tag
tail
Text after this element’s end tag, but before the next sibling element’s start tag. This is either a string or the value None, if there was no text.
text
Text before the first subelement. This is either a string or the value None, if there was no text.方法:
contains(self, element)
copy(self)
deepcopy(self, memo)
delitem(self, x)
Deletes the given subelement or a slice.
getitem(…)
Returns the subelement at the given position or the requested slice.
iter(self)
len(self)
Returns the number of subelements.
new(T, S, …)
nonzero(x)
x != 0
repr(self)
repr(x)
reversed(self)
setitem(self, x, value)
Replaces the given subelement index or slice.
_init(self)
Called after object initialisation. Custom subclasses may override this if they recursively call _init() in the superclasses.
addnext(self, element)
Adds the element as a following sibling directly after this element.
addprevious(self, element)
Adds the element as a preceding sibling directly before this element.
append(self, element)
Adds a subelement to the end of this element.
clear(self, keep_tail=False)
Resets an element. This function removes all subelements, clears all attributes and sets the text and tail properties to None.
cssselect(…)
Run the CSS expression on this element and its children, returning a list of the results.
extend(self, elements)
Extends the current children by the elements in the iterable.
find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.
findall(self, path, namespaces=None)
Finds all matching subelements, by tag name or path.
findtext(self, path, default=None, namespaces=None)
Finds text for the first matching subelement, by tag name or path.
get(self, key, default=None)
Gets an element attribute.
getchildren(self)
Returns all direct children. The elements are returned in document order.
getiterator(self, tag=None, *tags)
Returns a sequence or iterator of all elements in the subtree in document order (depth first pre-order), starting with this element.
getnext(self)
Returns the following sibling of this element or None.
getparent(self)
Returns the parent of this element or None for the root element.
getprevious(self)
Returns the preceding sibling of this element or None.
getroottree(self)
Return an ElementTree for the root node of the document that contains this element.
index(self, child, start=None, stop=None)
Find the position of the child within the parent.
insert(self, index, element)
Inserts a subelement at the given position in this element
items(self)
Gets element attributes, as a sequence. The attributes are returned in an arbitrary order.
iter(self, tag=None, *tags)
Iterate over all elements in the subtree in document order (depth first pre-order), starting with this element.
iterancestors(self, tag=None, *tags)
Iterate over the ancestors of this element (from parent to parent).
iterchildren(self, tag=None, reversed=False, *tags)
Iterate over the children of this element.
iterdescendants(self, tag=None, *tags)
Iterate over the descendants of this element in document order.
iterfind(self, path, namespaces=None)
Iterates over all matching subelements, by tag name or path.
itersiblings(self, tag=None, preceding=False, *tags)
Iterate over the following or preceding siblings of this element.
itertext(self, tag=None, with_tail=True, *tags)
Iterates over the text content of a subtree.
keys(self)
Gets a list of attribute names. The names are returned in an arbitrary order (just like for an ordinary Python dictionary).
makeelement(self, _tag, attrib=None, nsmap=None, **_extra)
Creates a new element associated with the same document.
remove(self, element)
Removes a matching subelement. Unlike the find methods, this method compares elements based on identity, not on tag value or contents.
replace(self, old_element, new_element)
Replaces a subelement with the element passed as second argument.
set(self, key, value)
Sets an element attribute.
values(self)
Gets element attribute values as a sequence of strings. The attributes are returned in an arbitrary order.
xpath(self, _path, namespaces=None, extensions=None, smart_strings=True, **_variables)
Evaluate an xpath expression using the element as context node.
通过上面介绍过的parse(source, parser=None, base_url=None)函数可以得到ElementTree对象,ElementTree对象具有和Element对象很多一样的方法。
具体如下:
ElementTree对象方法:
find(self, path, namespaces=None)
Finds the first toplevel element with given tag. Same as tree.getroot().find(path).
findall(self, path, namespaces=None)
Finds all elements matching the ElementPath expression. Same as getroot().findall(path).
findtext(self, path, default=None, namespaces=None)
Finds the text for the first element matching the ElementPath expression. Same as getroot().findtext(path)
查找与ElementPath表达式匹配的第一个元素的文本。 与getroot().findtext(path)相同
getelementpath(self, element)
Returns a structural, absolute ElementPath expression to find the element. This path can be used in the .find() method to look up the element, provided that the elements along the path and their list of immediate children were not modified in between.
返回一个结构化的绝对ElementPath表达式以查找该元素。 该路径可以在.find()方法中使用,以查找元素,前提是该路径中的元素及其直接子元素列表在这之间没有被修改
getiterator(self, tag=None, *tags)
Returns a sequence or iterator of all elements in document order (depth first pre-order), starting with the root element.
getpath(self, element)
Returns a structural, absolute XPath expression to find the element.
getroot(self)
Gets the root element for this tree.
iter(self, tag=None, *tags)
Creates an iterator for the root element. The iterator loops over all elements in this tree, in document order. Note that siblings of the root element (comments or processing instructions) are not returned by the iterator.
iterfind(self, path, namespaces=None)
Iterates over all elements matching the ElementPath expression. Same as getroot().iterfind(path).
parse(self, source, parser=None, base_url=None)
Updates self with the content of source and returns its root.
relaxng(self, relaxng)
Validate this document using other document.
write(self, file, encoding=None, method=“xml”, pretty_print=False, xml_declaration=None, with_tail=True, standalone=None, doctype=None, compression=0, exclusive=False, inclusive_ns_prefixes=None, with_comments=True, strip_text=False)
Write the tree to a filename, file or file-like object.
这个是 ElementTree 特有的方法,是将 ElementTree 写到 a file, a file-like object, or a URL (via FTP PUT or HTTP POST) 。可选参数和etree. tostring() 差不多,也有不同。
write_c14n(self, file, exclusive=False, with_comments=True, compression=0, inclusive_ns_prefixes=None)
C14N write of document. Always writes UTF-8.
xinclude(self)
Process the XInclude nodes in this document and include the referenced XML fragments.
xmlschema(self, xmlschema)
Validate this document using other document.
xpath(self, _path, namespaces=None, extensions=None, smart_strings=True, **_variables)
XPath evaluate in context of document.
xslt(self, _xslt, extensions=None, access_control=None, **_kw)
Transform this document using other document.
结合上面介绍的函数和类,用代码加以演示,综合应用
使用Element方法,参数即节点名称。
from __future__ import print_function
from lxml import etree
root = etree.Element('root') # 用Element函数创建Element对象,之后可以用Element类的方法和属性对该对象进行增删改查等操作
root
<Element root at 0x55a3a08>
使用tag属性,获取节点的名称。
root.tag
'root'
使用SubElement方法创建子节点,第一个参数为父节点(Element对象),第二个参数为子节点名称。
child1 = etree.SubElement(root, 'child1')
child2 = etree.SubElement(root, 'child2')
root.extend([etree.Element('child3'), etree.Element('child4')])
etree.dump(root)
<root>
<child1/>
<child2/>
<child3/>
<child4/>
</root>
print(root.getparent())
None
child1.getparent()
<Element root at 0x55a3a08>
root.index(child2)
1
all_direct_children = root.getchildren()
print(all_direct_children)
print(type(all_direct_children))
[<Element child1 at 0x55a36c8>, <Element child2 at 0x558ac48>, <Element child3 at 0x558a1c8>, <Element child4 at 0x558ab08>]
<class 'list'>
# 下标访问
child = root[0] # 同 root.find('child1').tag
child.tag
'child1'
root.insert(0, etree.Element('child0', attrib={'name': 'ch1'})) # 在root直接子元素中第0个位置插入
child0 = root[0]
child0.insert(0, etree.Element('grandson0', attrib={'name': 'gson', 'age': '3', 'type': 'insert'}))
etree.dump(root)
<root>
<child0 name="ch1">
<grandson0 age="3" name="gson" type="insert"/>
</child0>
<child1/>
<child2/>
<child3/>
<child4/>
</root>
root.append(etree.Element('append_child', attrib={'id': '1'})) # 尾部添加
root.append(etree.Element('append_child', attrib={'id': '2'})) # 尾部添加
child0 = root[0]
child0.append(etree.Element('append_grandson', attrib={'name': 'gson', 'age': '5', 'type': 'append'}))
etree.dump(root)
<root>
<child0 name="ch1">
<grandson0 age="3" name="gson" type="insert"/>
<append_grandson age="5" name="gson" type="append"/>
</child0>
<child1/>
<child2/>
<child3/>
<child4/>
<append_child id="1"/>
<append_child id="2"/>
</root>
add_elem = root.find('child4') # 或 child4 = root[3]
add_elem.addnext(etree.Element('add_cute_child', attrib={'name': 'add', 'kind': 'cute'}))
etree.dump(root)
<root>
<child0 name="ch1">
<grandson0 age="3" name="gson" type="insert"/>
<append_grandson age="5" name="gson" type="append"/>
</child0>
<child1/>
<child2/>
<child3/>
<child4/>
<add_cute_child kind="cute" name="add"/>
<append_child id="1"/>
<append_child id="2"/>
</root>
add_sibling = root.find('child4')
add_sibling.addprevious(etree.Element('add_preceding_sibling', attrib={'name': 'add', 'kind': 'sibling', 'site': 'preceding'}))
etree.dump(root)
<root>
<child0 name="ch1">
<grandson0 age="3" name="gson" type="insert"/>
<append_grandson age="5" name="gson" type="append"/>
</child0>
<child1/>
<child2/>
<child3/>
<add_preceding_sibling kind="sibling" name="add" site="preceding"/>
<child4/>
<add_cute_child kind="cute" name="add"/>
<append_child id="1"/>
<append_child id="2"/>
</root>
获取元素属性
get(self, key, default=None)
Gets an element attribute.
注:在3.2中还有介绍
r1 = root.find('add_preceding_sibling').get('kind') # 获取add_preceding_sibling元素的kind属性值
r1
'sibling'
root.find('child0')
<Element child0 at 0x504d108>
root.find('child0').find('grandson0')
<Element grandson0 at 0x55a3788>
root.find('child0/grandson0')
<Element grandson0 at 0x55a3788>
root.findall('child0')
[<Element child0 at 0x504d108>]
root.findall('append_child')
[<Element append_child at 0x55a1e48>, <Element append_child at 0x55a17c8>]
child1 = root[1]
print('child1 = ', child1)
print(child1.getprevious())
child1 = <Element child1 at 0x55a36c8>
<Element child0 at 0x504d108>
child1.getnext()
<Element child2 at 0x558ac48>
child1.getparent().tag
'root'
root.getchildren()
[<Element child0 at 0x504d108>,
<Element child1 at 0x55a36c8>,
<Element child2 at 0x558ac48>,
<Element child3 at 0x558a1c8>,
<Element add_preceding_sibling at 0x558a188>,
<Element child4 at 0x558ab08>,
<Element add_cute_child at 0x55a16c8>,
<Element append_child at 0x55a1e48>,
<Element append_child at 0x55a17c8>]
root_iterator = root.getiterator()
root_iterator
<lxml.etree.ElementDepthFirstIterator at 0x5574dc8>
for i in root_iterator:
print(i)
<Element root at 0x55a3a08>
<Element child0 at 0x504d108>
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x54e9608>
<Element child1 at 0x55a36c8>
<Element child2 at 0x558ac48>
<Element child3 at 0x558a1c8>
<Element add_preceding_sibling at 0x558a188>
<Element child4 at 0x558ab08>
<Element add_cute_child at 0x55a16c8>
<Element append_child at 0x55a1e48>
<Element append_child at 0x55a17c8>
root_iter = root.iter()
root_iter
<lxml.etree.ElementDepthFirstIterator at 0x55a5558>
for i in root_iter:
print(i)
<Element root at 0x55a3a08>
<Element child0 at 0x504d108>
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x55a1808>
<Element child1 at 0x55a36c8>
<Element child2 at 0x558ac48>
<Element child3 at 0x558a1c8>
<Element add_preceding_sibling at 0x558a188>
<Element child4 at 0x558ab08>
<Element add_cute_child at 0x55a16c8>
<Element append_child at 0x55a1e48>
<Element append_child at 0x55a17c8>
iterancestors = root.find('child0').find('grandson0').iterancestors() # 返回grandson0的所有祖先节点的迭代器
print(type(iterancestors))
iterancestors
<class 'lxml.etree.AncestorsIterator'>
<lxml.etree.AncestorsIterator at 0x55a5828>
for i in iterancestors:
print(i)
<Element child0 at 0x504d108>
<Element root at 0x55a3a08>
iterchildren = root.find('child0').iterchildren() # 返回child0的所有直接子节点的迭代器
iterchildren
<lxml.etree.ElementChildIterator at 0x55a59d8>
for i in iterchildren:
print(i)
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x5593cc8>
iterdescendants = root.iterdescendants() # 按文档顺序返回该元素的所有后代的迭代器
iterdescendants
<lxml.etree.ElementDepthFirstIterator at 0x55a5c60>
for i in iterdescendants:
print(i)
<Element child0 at 0x504d108>
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x5593cc8>
<Element child1 at 0x55a36c8>
<Element child2 at 0x558ac48>
<Element child3 at 0x558a1c8>
<Element add_preceding_sibling at 0x558a188>
<Element child4 at 0x558ab08>
<Element add_cute_child at 0x55a16c8>
<Element append_child at 0x55a1e48>
<Element append_child at 0x55a17c8>
itersiblings = root.find('child0').find('grandson0').itersiblings()
itersiblings
<lxml.etree.SiblingsIterator at 0x55a5d38>
for i in itersiblings:
print(i)
<Element append_grandson at 0x559dfc8>
iterfind = root.iterfind('child0/')
child0_child = [i for i in iterfind]
child0_child
[<Element grandson0 at 0x516a7c8>, <Element append_grandson at 0x4eb1f48>]
root.getroottree()
<lxml.etree._ElementTree at 0x55a7fc8>
len(root) # 子节点数量
9
root.index(child2) # 获取索引号
2
for child in root: # 遍历
print(child.tag)
child0
child1
child2
child3
add_preceding_sibling
child4
add_cute_child
append_child
append_child
start = root[1:] # 切片
start[0].tag
'child1'
end = root[-1:]
end[0].tag
'append_child'
root.replace(root.find('child2'), etree.Element('replace_child2', attrib={'type': 'replace'}))
etree.dump(root)
<root>
<child0 name="ch1">
<grandson0 age="3" name="gson" type="insert"/>
<append_grandson age="5" name="gson" type="append"/>
</child0>
<child1/>
<replace_child2 type="replace"/>
<child3/>
<add_preceding_sibling kind="sibling" name="add" site="preceding"/>
<child4/>
<add_cute_child kind="cute" name="add"/>
<append_child id="1"/>
<append_child id="2"/>
</root>
root.remove(child1) # 删除指定子节点
etree.dump(root)
<root>
<child0 name="ch1">
<grandson0 age="3" name="gson" type="insert"/>
<append_grandson age="5" name="gson" type="append"/>
</child0>
<replace_child2 type="replace"/>
<child3/>
<add_preceding_sibling kind="sibling" name="add" site="preceding"/>
<child4/>
<add_cute_child kind="cute" name="add"/>
<append_child id="1"/>
<append_child id="2"/>
</root>
root.clear() # 清除所有子节点
etree.dump(root)
<root/>
属性是以key-value的方式存储的,就像字典一样。
root = etree.Element('root', interesting='totally')
etree.dump(root)
<root interesting="totally"/>
root.set('hello', 'Huhu')
etree.dump(root)
<root interesting="totally" hello="Huhu"/>
root.items()
[('interesting', 'totally'), ('hello', 'Huhu')]
xxx = root.makeelement('make_element', attrib={'att': 'make'})
xxx
<Element make_element at 0x559d5c8>
etree.dump(xxx)
<make_element att="make"/>
etree.dump(root)
<root interesting="totally" hello="Huhu"/>
# get方法获得某一个属性值
root.get('interesting')
'totally'
root.get('xyz', default='123')
'123'
如果获取的属性不存在,也不会报错。类似字典的get,获取不到key,也不会报错。
root.get('xyz')
my_dic = {'a': 1, 'b': 2}
my_dic.get('xxx')
根节点的tag可以重新设置,但其他节点不行,如果重命名其他节点,相当于添加新节点。
root.tag = 'rootxuy'
etree.dump(root)
<rootxuy interesting="totally"/>
child = etree.SubElement(root, 'child', attrib={"a": '123'})
child.tag = 'great_child'
etree.dump(root)
<rootxuy interesting="totally">
<child a="123"/>
<great_child a="123"/>
</rootxuy>
root.tag
'rootxuy'
sorted(root.keys())
['hello', 'interesting']
# items方法获取所有的键值对
for name, value in sorted(root.items()):
print('%s = %r' % (name, value))
hello = 'Huhu'
interesting = 'totally'
也可以用attrib属性一次拿到所有的属性及属性值存于字典中
attributes = root.attrib
attributes
{'hello': 'Huhu', 'interesting': 'totally'}
attributes['good'] = 'Bye' # 字典的修改影响节点
root.get('good')
'Bye'
root.values()
['totally', 'Huhu', 'Bye']
标签及标签的属性操作介绍完了,最后就剩下标签内的文本了。可以使用text和tail属性、或XPath的方式来访问文本内容。
一般情况,可以用Element的text属性访问标签的文本。
text
Text before the first subelement. This is either a string or the value None, if there was no text.
第一个子节点之前的文本。如果没有文本,则为字符串或None。
tail
Text after this element’s end tag, but before the next sibling element’s start tag. This is either a string or the value None, if there was no text.
文本位于此节点的结束标记之后,但位于下一个同级节点的开始标记之前。如果没有文本,则为字符串或None。
root = etree.parse('./sample.xml')
xml_root = root.getroot()
etree.dump(xml_root)
<TradingAccounts>
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
<Strategies>
<Strategy name="CTA01" trade="true" commission="flase"/>
<Strategy name="CTA02" trade="true" commission="flase"/>
<Strategy name="ALPHA"/>
</Strategies>
<Accounts>
<Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
<Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
<Strategy name="CTA02" num="10" prior="2" id="998"/>
</Account>
<Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
<Strategy name="CTA01" num="2" prior="1" id="999">this is text
<Type id="10" name="FOF"/>
same text
</Strategy>
<Strategy name="CTA02" num="5" prior="2" id="1000"/>
</Account>
<Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
<Strategy name="CTA01" num="5" prior="1" id="1001">
<Commission id="20" rate="0.01"/>
<Slip param="1"/>
</Strategy>
<Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
</Account>
</Accounts>
</TradingAccounts>
xml_root.text = 'Hello, World!\n'
xml_root.find('Constants').text = 'this is Constants'
xml_root.text
'Hello, World!\n'
etree.dump(xml_root)
<TradingAccounts>Hello, World!
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10">this is Constants</Constants>
<Strategies>
<Strategy name="CTA01" trade="true" commission="flase"/>
<Strategy name="CTA02" trade="true" commission="flase"/>
<Strategy name="ALPHA"/>
</Strategies>
<Accounts>
<Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
<Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
<Strategy name="CTA02" num="10" prior="2" id="998"/>
</Account>
<Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
<Strategy name="CTA01" num="2" prior="1" id="999">this is text
<Type id="10" name="FOF"/>
same text
</Strategy>
<Strategy name="CTA02" num="5" prior="2" id="1000"/>
</Account>
<Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
<Strategy name="CTA01" num="5" prior="1" id="1001">
<Commission id="20" rate="0.01"/>
<Slip param="1"/>
</Strategy>
<Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
</Account>
</Accounts>
</TradingAccounts>
itertext = xml_root.itertext()
itertext
<lxml.etree.ElementTextIterator at 0x559af60>
for i in itertext:
if str.strip(i):
print('str.strip(i) = ', str.strip(i), '---------->', len(str.strip(i)))
str.strip(i) = Hello, World! ----------> 13
str.strip(i) = this is Constants ----------> 17
str.strip(i) = first strategy ----------> 14
str.strip(i) = this is text ----------> 12
str.strip(i) = same text ----------> 9
str.strip(i) = last strategy ----------> 13
text = xml_root.findtext('Accounts') # 查找第一个匹配到的元素为Accounts的text
print('text = ', text)
print('len(text)= ', len(text))
print(type(text))
text =
len(text)= 9
<class 'str'>
text = xml_root.findtext('Accounts/Account/Strategy')
print('text = ', text)
print('len(text)= ', len(text))
print(type(text))
text =
len(text)= 0
<class 'str'>
text = xml_root.findtext('Constants')
print('text = ', text)
print('len(text)= ', len(text))
print(type(text))
text = this is Constants
len(text)= 17
<class 'str'>
print(xml_root.xpath('Accounts/Account/Strategy//text()'))
['this is text\n ', '\n same text\n ', '\n ', '\n ', '\n ']
<html><body>Text<br/>Tail</body></html>
html = etree.Element('html')
body = etree.SubElement(html, 'body')
body.text = 'Text'
etree.dump(html)
<html>
<body>Text</body>
</html>
br = etree.SubElement(body, 'br')
etree.dump(html)
<html>
<body>Text<br/></body>
</html>
# tail仅在该标签后面追加文本
br.tail = 'Tail'
etree.dump(br)
<br/>Tail
etree.tostring(html)
b'<html><body>Text<br/>Tail</body></html>'
# tostring方法增加method参数,过滤单一标签,输出全部文本
etree.tostring(html, method='text') # method参数默认是xml
b'TextTail'
# 方式一:过滤单一标签,返回文本
html.xpath('string()')
'TextTail'
# 方式二:返回列表,以单一标签为分隔
html.xpath('//text()')
['Text', 'Tail']
# 方法二获得的列表,每个元素都会带上它所属节点及文本类型信息,如下:
texts = html.xpath('//text()')
texts[0]
'Text'
type(texts[0])
lxml.etree._ElementUnicodeResult
etree.iselement(texts[0]) # 判断是否为element对象
False
# 所属节点
parent = texts[0].getparent()
parent.tag
'body'
print(texts[1], texts[1].getparent().tag)
Tail br
# 文本类型:是普通文本还是tail文本
print(texts[0].is_text)
True
print(texts[1].is_text)
False
print(texts[1].is_tail)
True
这部分讲述如何将XML文件解析为Element对象,以及如何将Element对象输出为XML文件。
文件解析常用的有fromstring、XML和HTML三个方法。接受的参数都是字符串。
xml_data = '<root>data</root>'
root1 = etree.fromstring(xml_data)
root1.tag
'root'
etree.tostring(root1)
b'<root>data</root>'
root2 = etree.XML(xml_data)
print(root2.tag)
root
print(etree.tostring(root2))
b'<root>data</root>'
root3 = etree.HTML(xml_data)
print(root3.tag)
html
print(etree.tostring(root3))
b'<html><body><root>data</root></body></html>'
输出其实就是前面一直在用的tostring方法了,这里补充xml_declaration和encoding两个参数,前者是XML声明,后者是指定编码。
root = etree.XML('<root><a><b/></a></root>')
print(etree.tostring(root))
b'<root><a><b/></a></root>'
# XML声明
print(etree.tostring(root, xml_declaration=True))
b"<?xml version='1.0' encoding='ASCII'?>\n<root><a><b/></a></root>"
# 指定编码
print(etree.tostring(root, encoding='iso-8859-1'))
b"<?xml version='1.0' encoding='iso-8859-1'?>\n<root><a><b/></a></root>"
et = etree.parse('./sample.xml')
# 也可以用ElementTree类的parse方法, 结果是一样的。
# et = etree.ElementTree().parse('./sample.xml')
print(type(et))
et.getroot().set('add_root_attrib', 'attrib_value') # 为root节点添加/修改属性值
etree.dump(et.getroot())
<class 'lxml.etree._ElementTree'>
<TradingAccounts add_root_attrib="attrib_value">
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
<Strategies>
<Strategy name="CTA01" trade="true" commission="flase"/>
<Strategy name="CTA02" trade="true" commission="flase"/>
<Strategy name="ALPHA"/>
</Strategies>
<Accounts>
<Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
<Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
<Strategy name="CTA02" num="10" prior="2" id="998"/>
</Account>
<Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
<Strategy name="CTA01" num="2" prior="1" id="999">this is text
<Type id="10" name="FOF"/>
same text
</Strategy>
<Strategy name="CTA02" num="5" prior="2" id="1000"/>
</Account>
<Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
<Strategy name="CTA01" num="5" prior="1" id="1001">
<Commission id="20" rate="0.01"/>
<Slip param="1"/>
</Strategy>
<Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
</Account>
</Accounts>
</TradingAccounts>
print(type(et))
<class 'lxml.etree._ElementTree'>
et.write('./update_XML.xml') # 生成新的xml文件
xml = etree.parse('./sample.xml') # 解析xml,返回ElementTree对象
print(xml)
print(type(xml))
<lxml.etree._ElementTree object at 0x000000000559D3C8>
<class 'lxml.etree._ElementTree'>
# 找根元素
print(xml.getroot())
print(xml.getroot().tag)
print(xml.find('TradingAccounts')) # xml解析后返回的ElementTree对象,不可以这样查找根元素
print(xml.getroot())# 应该这样找根元素
<Element TradingAccounts at 0x5510c48>
TradingAccounts
None
<Element TradingAccounts at 0x5510c48>
# 下面两者等价
print(xml.find('Constants'))
print(xml.getroot().find('Constants'))
print(xml.find('Constants').tag)
print(xml.getroot().find('Constants').tag)
# xml 和 xml.getroot()的区别:
print(type(xml), ' <-----VS-----> ', type(xml.getroot()))
# ElementTree 和 Element对象 都具有find、findall方法
<Element Constants at 0x55b8ac8>
<Element Constants at 0x55b8ac8>
Constants
Constants
<class 'lxml.etree._ElementTree'> <-----VS-----> <class 'lxml.etree._Element'>
# attrib返回属性-值(key-value)的dict
print(xml.find('Constants').attrib)
print(xml.getroot().find('Constants').attrib)
{'path': '/home/DOTA/Trade', 'cpu': '10', 'ProjectName': 'DOTA'}
{'path': '/home/DOTA/Trade', 'cpu': '10', 'ProjectName': 'DOTA'}
# find()方法:返回匹配到的第一个元素,从直接子元素开始找
first_elem = xml.find('Constants')
print('first_elem= ', first_elem)
print(first_elem.tag)
first_elem = xml.find('Strategy') # 直接子元素中没有Strategy元素,因此返回None
print('first_elem= ', first_elem)
first_elem= <Element Constants at 0x55b8208>
Constants
first_elem= None
search_first_elem = xml.find('.//Strategy') # 在全部元素中查找第一个出现的Strategy元素
print('search_first_elem= ', search_first_elem)
print(search_first_elem.tag)
print('search_first_elem.attrib = ', search_first_elem.attrib) # attrib返回dict
search_first_elem= <Element Strategy at 0x55b8a48>
Strategy
search_first_elem.attrib = {'name': 'CTA01', 'trade': 'true', 'commission': 'flase'}
# 查找Accounts元素下的所有元素中第一个Strategy元素;//表示从当前节点选取子孙节点;/表示从当前节点选取直接子节点
search_elem = xml.find('./Accounts//Strategy')
print('search_elem= ', search_elem)
print(search_elem.tag)
print('search_elem.attrib = ', search_elem.attrib)
search_elem= <Element Strategy at 0x55b8ac8>
Strategy
search_elem.attrib = {'id': '997', 'name': 'CTA01', 'num': '3', 'prior': '1'}
# 找直接子元素Strategies下的Strategy元素的name属性的值
print(xml.find('Strategies').find('Strategy').attrib.get('name'))
CTA01
# findall()方法 返回所有匹配的元素的列表
all = xml.findall('.//Strategy') # 返回匹配到的所有的Strategy元素的列表
print('all= ', all)
print('len(all)=', len(all))
all_names_1 = [i.get('name') for i in all] # i为Element对象
all_names_2 = [i.attrib.get('name') for i in all] # i.attrib.get('name') 与 i.get('name')等价
print('all_names_1 = ', all_names_1)
print('all_names_2 = ', all_names_2)
all= [<Element Strategy at 0x55b8a48>, <Element Strategy at 0x55b8cc8>, <Element Strategy at 0x55b8fc8>, <Element Strategy at 0x55b8ac8>, <Element Strategy at 0x55b8ec8>, <Element Strategy at 0x55b8f48>, <Element Strategy at 0x55b8dc8>, <Element Strategy at 0x55b8f08>, <Element Strategy at 0x55b8d08>]
len(all)= 9
all_names_1 = ['CTA01', 'CTA02', 'ALPHA', 'CTA01', 'CTA02', 'CTA01', 'CTA02', 'CTA01', 'CTA02']
all_names_2 = ['CTA01', 'CTA02', 'ALPHA', 'CTA01', 'CTA02', 'CTA01', 'CTA02', 'CTA01', 'CTA02']
# 返回直接子元素Accounts下的Account元素下的Strategy元素下的所有元素
child_all_elem = xml.findall('Accounts/Account/Strategy/')
print('child_all_elem = ', child_all_elem)
child_all_elem_tags = [i.tag for i in child_all_elem]
print('child_all_elem_tags = ', child_all_elem_tags)
child_all_elem = [<Element Type at 0x558ae48>, <Element Commission at 0x558ab88>, <Element Slip at 0x558a9c8>]
child_all_elem_tags = ['Type', 'Commission', 'Slip']
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。