当前位置:   article > 正文

使用StAX解析器解析大型xml文件_stax解析xml通用操作

stax解析xml通用操作

前言

本文大部分摘抄于IBM developerworks(主要是理论),详下面三篇文章,摘抄主要是为了使自己理解更深一点儿,仅当作笔记而已...也是为了以后再次使用时有个参考!摘抄并不全面,原文内容要丰富地多,详见原文。

参考文章:

使用 StAX 解析 XML,第 1 部分: Streaming API for XML (StAX) 简介:http://www.ibm.com/developerworks/cn/xml/x-stax1.html
使用 StAX 解析 XML,第 2 部分: 拉式解析和事件:http://www.ibm.com/developerworks/cn/xml/x-stax2.html
使用 StAX 解析 XML,第 3 部分: 使用定制事件和编写 XML:http://www.ibm.com/developerworks/cn/xml/x-stax3.html
————————————————
原文链接:https://blog.csdn.net/zhyh1986/article/details/8528649

关于对StAX的描述不再做过多描述了,说一下我解析xml文件遇到的问题

需求:
想解析一个4GB大小的xml文件里的所有标签为entity的内容包括嵌套的子标签以及内容,并将解析出来的这些entity数据均匀的写入7个新的xml文件里

解析xml的方法大体来说有四种:

  • DOM解析
  • SAX解析
  • DOM4J解析
  • JDOM解析

这四种方法的利弊比较:


1.SAX解析(Simple API for XML)

SAX解析方式:逐行扫描文档,一边扫描一边解析。相比于DOM,SAX可以在解析文档的任意时刻停止解析解析,是一种速度更快,更高效的方法。

优点:不用事先调入整个文档,占用资源少。解析可以立即开始,速度快,没有内存压力。

缺点:不能对结点做修改

适用:读取XML文件
 

2.DOM解析(Document Object Model)

DOM解析方式:为 解析XML 文档定义了一组接口。解析器读入整个文档,然后在内存中建立一个树结构, 然后就可以使用 DOM 接口来操作这个树结构。

优点:整个文档树在内存中,便于操作;支持删除、修改、重新排列等多种功能

缺点:如果文件比较大,内存有压力,解析的时间会比较长。将整个文档调入内 存(包括无用的节点),浪费时间和空间。

适用:修改XML数据
 


 

3.JDOM

JDOM是处理xml的纯java api.使用具体类而不是接口.JDOM具有树的遍历,又有SAX的java规则.JDOM与DOM主要有两方面不同。
首先,JDOM仅使用具体类而不使用接口。这在某些方面简化了API,但是也限制了灵活性。
第二,API大量使用了Collections类,简化了那些已经熟悉这些类的Java开发者的使用。

JDOM自身不包含解析器。它通常使用SAX2解析器来解析和验证输入XML文档(尽管它还可以将以前构造的DOM表示作为输入)。它包含一些转换器以将JDOM表示输出成SAX2事件流、DOM模型或XML文本文档。

 
优点:1、是基于树的处理xml的java api,把树加载到内存中.

2、没有向下兼容的限制,所以比DOM简单.

3、速度快.

4、具有SAX的java 规则.

缺点:1、不能处理大于内存的文档.

2、JDOM表示XML文档逻辑模型,不能保证每个字节真正变换.

3、 针对实例文档不提供DTD与模式的任何实际模型.

4、 不支持于DOM中相应遍历包.
 

4.DOM4J

DOM4J有更复杂的api,所以dom4j比jdom有更大的灵活性.DOM4J性能最好,连Sun的JAXM也在用DOM4J.目前许多开源项目中大量采用DOM4J,例如大名鼎鼎的Hibernate也用DOM4J来读取XML配置文件。如果不考虑可移植性,那就采用DOM4J.

优点:灵活性最高、易用性和功能强大、性能优异

缺点:复杂的api、移植性差

以上这四种方法,我基本都有试过用来解析上述的需求

第一个用的就是DOM解析,但是这个方法只能解析小一点的xml文件,太大的会内存溢出 因为它是一次性加载整个文档的

后面用过DOM4J和SAX,但是都由于电脑系统内存的问题,还是会报JVM内存溢出的问题

没有办法,最后查到了StAX也可以解析大型XML文件的方法

截取一部分要解析的xml文件:

  1. <?xml version='1.0' encoding='UTF-8'?>
  2. <gwl>
  3. <version>20230417084108</version>
  4. <entities>
  5. <entity id="1123831" version="20230414163503">
  6. <name>ALMOND, LINCOLN CARTER</name>
  7. <listId>1021</listId>
  8. <listCode>USP</listCode>
  9. <entityType>03</entityType>
  10. <createdDate>09/02/2004</createdDate>
  11. <lastUpdateDate>04/14/2023</lastUpdateDate>
  12. <source>USP</source>
  13. <OriginalSource>PEP</OriginalSource>
  14. <dobs>
  15. <dob Y="1936">06/16/1936</dob>
  16. </dobs>
  17. <pobs>
  18. <pob>Pawtucket, Rhode Island, United States</pob>
  19. </pobs>
  20. <titles>
  21. <title>FORMER GOVERNOR OF RHODE ISLAND (JANUARY 3, 1995 - JANUARY 7, 2003). DECEASED JANUARY 02, 2023.</title>
  22. </titles>
  23. <sdfs>
  24. <sdf name="OtherInformation">Career: Governor of Rhode Island (January 03, 1995 - January 07, 2003); United State Attorney for the District of Rhode Island (October 09, 1981 - January 20, 1993); United State Attorney for the District of Rhode Island (1969 - 1978).</sdf>
  25. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=d14d930f-7943-4363-b4d0-aa2c59437e1b</sdf>
  26. <sdf name="EffectiveDate">1981</sdf>
  27. <sdf name="EntityLevel">State</sdf>
  28. <sdf name="ExpirationDate">1993</sdf>
  29. <sdf name="Gender">MALE</sdf>
  30. <sdf name="NameSource">Website</sdf>
  31. <sdf name="Org_PID">1706394</sdf>
  32. <sdf name="OriginalID">7031</sdf>
  33. <sdf name="Relationship">Father</sdf>
  34. <sdf name="SubCategory">Former PEP</sdf>
  35. </sdfs>
  36. <addresses>
  37. <address>
  38. <country>US</country>
  39. <countryName>UNITED STATES</countryName>
  40. </address>
  41. </addresses>
  42. </entity>
  43. <entity id="1124766" version="20230414163503">
  44. <name>BAUCUS, MAX SIEBEN</name>
  45. <listId>1021</listId>
  46. <listCode>USP</listCode>
  47. <entityType>03</entityType>
  48. <createdDate>09/02/2004</createdDate>
  49. <lastUpdateDate>04/14/2023</lastUpdateDate>
  50. <source>USP</source>
  51. <OriginalSource>PEP</OriginalSource>
  52. <dobs>
  53. <dob Y="1941">12/11/1941</dob>
  54. </dobs>
  55. <pobs>
  56. <pob>Helena, Montana, United States</pob>
  57. </pobs>
  58. <aliases>
  59. <alias type="Alias">ENKE, MAX SIEBEN</alias>
  60. </aliases>
  61. <titles>
  62. <title>FORMER AMBASSADOR OF THE UNITED STATES TO CHINA (MARCH 20, 2014 - JANUARY 16, 2017).</title>
  63. </titles>
  64. <sdfs>
  65. <sdf name="OtherInformation">Political Party: Democratic. Career: Ambassador Extraordinary and Plenipotentiary of the United States to China, (March 20, 2014 - January 16, 2017); Member of the United States Congress, Senate from Montana (December 15, 1978 - February 06, 2014);</sdf>
  66. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=945fd382-f5b7-42c4-ad1f-a40c4bf0e285</sdf>
  67. <sdf name="EffectiveDate">1978</sdf>
  68. <sdf name="EntityLevel">National</sdf>
  69. <sdf name="ExpirationDate">2014</sdf>
  70. <sdf name="Gender">MALE</sdf>
  71. <sdf name="NameSource">Website</sdf>
  72. <sdf name="Org_PID">548118</sdf>
  73. <sdf name="OriginalID">7542</sdf>
  74. <sdf name="Relationship">Brother</sdf>
  75. <sdf name="SubCategory">Former PEP</sdf>
  76. </sdfs>
  77. <addresses>
  78. <address>
  79. <country>US</country>
  80. <countryName>UNITED STATES</countryName>
  81. <province>WASHINGTON, DC</province>
  82. <postalCode>20515</postalCode>
  83. </address>
  84. <address>
  85. <country>US</country>
  86. <countryName>UNITED STATES</countryName>
  87. <province>WASHINGTON, D.C.</province>
  88. <postalCode>20510</postalCode>
  89. </address>
  90. <address>
  91. <address1>55 ANJIALOU RD</address1>
  92. <city>BEIJING</city>
  93. <country>CN</country>
  94. <countryName>CHINA</countryName>
  95. <postalCode>100600</postalCode>
  96. </address>
  97. </addresses>
  98. </entity>
  99. <entity id="1124842" version="20230414163503">
  100. <name>THOMAS, CRAIG LYLE</name>
  101. <listId>1021</listId>
  102. <listCode>USP</listCode>
  103. <entityType>03</entityType>
  104. <createdDate>09/02/2004</createdDate>
  105. <lastUpdateDate>04/14/2023</lastUpdateDate>
  106. <source>USP</source>
  107. <OriginalSource>PEP</OriginalSource>
  108. <dobs>
  109. <dob Y="1933">02/17/1933</dob>
  110. </dobs>
  111. <pobs>
  112. <pob>Cody, Wyoming, United States</pob>
  113. </pobs>
  114. <titles>
  115. <title>FORMER MEMBER OF THE UNITED STATES CONGRESS (JANUARY 03, 1995 - JUNE 04, 2007). DECEASED JUNE 04, 2007.</title>
  116. </titles>
  117. <sdfs>
  118. <sdf name="OtherInformation">Political Party: Republican. Career: Member of the United States Congress, Senate, Class I (January 03, 1995 - June 04, 2007); Member of the United States Congress, House of Representatives, At-Large (April 27, 1989 - January 03, 1995). Member of the</sdf>
  119. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=4e7b1050-36b5-4b1c-9037-c2349c519d40</sdf>
  120. <sdf name="EffectiveDate">1989</sdf>
  121. <sdf name="EntityLevel">National</sdf>
  122. <sdf name="ExpirationDate">1995</sdf>
  123. <sdf name="Gender">MALE</sdf>
  124. <sdf name="NameSource">Website</sdf>
  125. <sdf name="Org_PID">1817490</sdf>
  126. <sdf name="OriginalID">7629</sdf>
  127. <sdf name="Relationship">Father</sdf>
  128. <sdf name="SubCategory">Former PEP</sdf>
  129. </sdfs>
  130. <addresses>
  131. <address>
  132. <country>US</country>
  133. <countryName>UNITED STATES</countryName>
  134. <province>WASHINGTON D.C.</province>
  135. <postalCode>20510</postalCode>
  136. </address>
  137. <address>
  138. <address1>200 WEST 24TH STREET</address1>
  139. <city>CHEYENNE</city>
  140. <state>WY</state>
  141. <stateName>WYOMING</stateName>
  142. <country>US</country>
  143. <countryName>UNITED STATES</countryName>
  144. <postalCode>82002</postalCode>
  145. </address>
  146. </addresses>
  147. </entity>
  148. <entity id="1125230" version="20230414163051">
  149. <name>PATRIAT, FRANCOIS</name>
  150. <listId>1020</listId>
  151. <listCode>PEP</listCode>
  152. <entityType>03</entityType>
  153. <createdDate>09/02/2004</createdDate>
  154. <lastUpdateDate>04/14/2023</lastUpdateDate>
  155. <source>PEP</source>
  156. <OriginalSource>PEP</OriginalSource>
  157. <dobs>
  158. <dob Y="1943">03/21/1943</dob>
  159. </dobs>
  160. <pobs>
  161. <pob>Semur-en-Auxois, , France</pob>
  162. </pobs>
  163. <titles>
  164. <title>MEMBER OF THE FRENCH PARLIAMENT (OCTOBER 01, 2008 - 2026).</title>
  165. </titles>
  166. <sdfs>
  167. <sdf name="OtherInformation">Political party: La Republique en marche (LREM) (currently known as Renaissance). Career: Member of the Executive Bureau of La Republique en Marche (LREM), The Republic on the Move (currently known as Renaissance), effective from November 18, 2017;</sdf>
  168. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=a4ffd4f3-5c75-440b-aeca-4e3a7d2ef642</sdf>
  169. <sdf name="EffectiveDate">2008</sdf>
  170. <sdf name="EntityLevel">National</sdf>
  171. <sdf name="ExpirationDate">2026</sdf>
  172. <sdf name="Gender">MALE</sdf>
  173. <sdf name="NameSource">Website</sdf>
  174. <sdf name="Org_PID">3759009</sdf>
  175. <sdf name="OriginalID">8117</sdf>
  176. <sdf name="Relationship">Associate</sdf>
  177. <sdf name="SubCategory">Govt Branch Member</sdf>
  178. </sdfs>
  179. <addresses>
  180. <address>
  181. <address1>15, RUE DE VAUGIRARD</address1>
  182. <city>PARIS</city>
  183. <country>FR</country>
  184. <countryName>FRANCE</countryName>
  185. <postalCode>75291</postalCode>
  186. </address>
  187. </addresses>
  188. </entity>
  189. <entity id="1125282" version="20230414163052">
  190. <name>BENOUTIQ, ABDELKRIM</name>
  191. <listId>1020</listId>
  192. <listCode>PEP</listCode>
  193. <entityType>03</entityType>
  194. <createdDate>09/02/2004</createdDate>
  195. <lastUpdateDate>04/14/2023</lastUpdateDate>
  196. <source>PEP</source>
  197. <OriginalSource>PEP</OriginalSource>
  198. <dobs>
  199. <dob Y="1959">08/19/1959</dob>
  200. </dobs>
  201. <pobs>
  202. <pob>Rabat, Rabat-Sale-Kenitra Region, Morocco</pob>
  203. </pobs>
  204. <aliases>
  205. <alias type="Alias">BEN ATIQ, ABDELKRIM</alias>
  206. <alias type="Alias">BENATIQ, ABDELKRIM</alias>
  207. </aliases>
  208. <nativeCharNames>
  209. <nativeCharName charSet="" latinCharName="BEN ATIQ, ABDELKRIM" type="Alias">??? ?????? ?? ????</nativeCharName>
  210. <nativeCharName charSet="" latinCharName="BENATIQ, ABDELKRIM" type="Alias">??? ?????? ??????</nativeCharName>
  211. <nativeCharName charSet="" latinCharName="BENOUTIQ, ABDELKRIM" type="Primary">??? ?????? ??????</nativeCharName>
  212. </nativeCharNames>
  213. <titles>
  214. <title>FORMER MEMBER OF THE POLITICAL BUREAU OF SOCIALIST UNION OF POPULAR FORCES PARTY, MOROCCO, ELECTED JUNE 10, 2017, EFFECTIVE UNTIL APRIL 24, 2022.</title>
  215. </titles>
  216. <sdfs>
  217. <sdf name="OtherInformation">Political Party: Union Socialiste Des Forces Populaires (USFP) Career: Member of the Political Bureau of Union Socialiste Des Forces Populaires (USFP), Socialist Union of Popular Forces Party, elected June 10, 2017, effective until April 24, 2022;</sdf>
  218. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=35f8bcea-6169-4a8f-9715-81de730d1c17</sdf>
  219. <sdf name="EffectiveDate">2000</sdf>
  220. <sdf name="EntityLevel">National</sdf>
  221. <sdf name="ExpirationDate">2001</sdf>
  222. <sdf name="Gender">MALE</sdf>
  223. <sdf name="NameSource">Website</sdf>
  224. <sdf name="OriginalID">8181</sdf>
  225. <sdf name="SubCategory">Former PEP</sdf>
  226. </sdfs>
  227. <addresses>
  228. <address>
  229. <address1>9, AVENUE AL ARAAR</address1>
  230. <city>RABAT</city>
  231. <country>MA</country>
  232. <countryName>MOROCCO</countryName>
  233. <province>RABAT-SALE-KENITRA REGION</province>
  234. </address>
  235. <address>
  236. <address1>AVENUE F.ROOSEVELT</address1>
  237. <city>RABAT</city>
  238. <country>MA</country>
  239. <countryName>MOROCCO</countryName>
  240. <province>RABAT-SALE-KENITRA REGION</province>
  241. </address>
  242. <address>
  243. <address1>NO. 9 ARAR STREET</address1>
  244. <city>RABAT</city>
  245. <country>MA</country>
  246. <countryName>MOROCCO</countryName>
  247. <province>RABAT-SALE-KENITRA REGION</province>
  248. </address>
  249. </addresses>
  250. </entity>
  251. <entity id="1125443" version="20230414163053">
  252. <name>OLLING, SVEND</name>
  253. <listId>1020</listId>
  254. <listCode>PEP</listCode>
  255. <entityType>03</entityType>
  256. <createdDate>09/02/2004</createdDate>
  257. <lastUpdateDate>04/14/2023</lastUpdateDate>
  258. <source>PEP</source>
  259. <OriginalSource>PEP</OriginalSource>
  260. <dobs>
  261. <dob Y="1967">11/09/1967</dob>
  262. </dobs>
  263. <pobs>
  264. <pob>Glostrup, , Denmark</pob>
  265. </pobs>
  266. <titles>
  267. <title>AMBASSADOR OF DENMARK TO SOUTH KOREA, AS OF MARCH 30, 2023.</title>
  268. </titles>
  269. <sdfs>
  270. <sdf name="OtherInformation">Career: Ambassador of Denmark to South Korea, as of March 30, 2023; Ambassador of Denmark to Egypt, as of May 28, 2020, expiration reported March 20, 2023; Non-Resident Ambassador of Denmark to Azerbaijan, effective from March 26, 2017, expiration</sdf>
  271. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=ef160921-f06b-4942-9527-0ee7565467c0</sdf>
  272. <sdf name="EffectiveDate">2023</sdf>
  273. <sdf name="EntityLevel">International</sdf>
  274. <sdf name="Gender">MALE</sdf>
  275. <sdf name="NameSource">Website</sdf>
  276. <sdf name="Org_PID">8698914</sdf>
  277. <sdf name="OriginalID">8384</sdf>
  278. <sdf name="Relationship">Father</sdf>
  279. <sdf name="SubCategory">Diplomat</sdf>
  280. </sdfs>
  281. <addresses>
  282. <address>
  283. <address1>416, HANGANG-DAERO, JUNG-GU</address1>
  284. <city>SEOUL</city>
  285. <country>KR</country>
  286. <countryName>KOREA, REPUBLIC OF</countryName>
  287. <postalCode>04637</postalCode>
  288. </address>
  289. <address>
  290. <address1>TURAN GUENES BULVARI 106</address1>
  291. <city>ANKARA</city>
  292. <country>TR</country>
  293. <countryName>TURKEY</countryName>
  294. <postalCode>06550</postalCode>
  295. </address>
  296. <address>
  297. <address1>ASIATISK PLADS 2</address1>
  298. <city>COPENHAGEN</city>
  299. <country>DK</country>
  300. <countryName>DENMARK</countryName>
  301. <postalCode>1448</postalCode>
  302. </address>
  303. <address>
  304. <address1>NORTH AVENUE</address1>
  305. <city>DHAKA</city>
  306. <country>BD</country>
  307. <countryName>BANGLADESH</countryName>
  308. <postalCode>1212</postalCode>
  309. </address>
  310. <address>
  311. <city>CAIRO</city>
  312. <country>EG</country>
  313. <countryName>EGYPT</countryName>
  314. </address>
  315. </addresses>
  316. </entity>
  317. <entity id="1125610" version="20230414163054">
  318. <name>TAKAHASHI, KOICHI</name>
  319. <listId>1020</listId>
  320. <listCode>PEP</listCode>
  321. <entityType>03</entityType>
  322. <createdDate>09/02/2004</createdDate>
  323. <lastUpdateDate>04/14/2023</lastUpdateDate>
  324. <source>PEP</source>
  325. <OriginalSource>PEP</OriginalSource>
  326. <dobs>
  327. <dob Y="1944">1944</dob>
  328. </dobs>
  329. <nativeCharNames>
  330. <nativeCharName charSet="" latinCharName="TAKAHASHI, KOICHI" type="Primary">たかはし こういち</nativeCharName>
  331. <nativeCharName charSet="" latinCharName="TAKAHASHI, KOICHI" type="Primary">高橋 恒一</nativeCharName>
  332. </nativeCharNames>
  333. <titles>
  334. <title>FORMER AMBASSADOR OF JAPAN TO THE CZECH REPUBLIC (FEBRUARY 03, 2003 - 2005).</title>
  335. </titles>
  336. <sdfs>
  337. <sdf name="OtherInformation">Career: Ambassador of Japan to the Czech Republic (February 03, 2003 - 2005); Deputy Vice-Minister in charge of Immigration Bureau, Ministry of Justice (1999 - 2001); Consul-General of Japan to Berlin City, Germany (1995 - 1997); Minister of Japan to</sdf>
  338. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=9b2a063e-8d55-4806-b2f2-f2c79d815a33</sdf>
  339. <sdf name="EffectiveDate">1999</sdf>
  340. <sdf name="EntityLevel">National</sdf>
  341. <sdf name="ExpirationDate">2001</sdf>
  342. <sdf name="Gender">MALE</sdf>
  343. <sdf name="NameSource">Website</sdf>
  344. <sdf name="OriginalID">8483</sdf>
  345. <sdf name="SubCategory">Former PEP</sdf>
  346. </sdfs>
  347. <addresses>
  348. <address>
  349. <country>JP</country>
  350. <countryName>JAPAN</countryName>
  351. </address>
  352. </addresses>
  353. </entity>
  354. <entity id="1125925" version="20230414163054">
  355. <name>PINTER, SANDOR</name>
  356. <listId>1020</listId>
  357. <listCode>PEP</listCode>
  358. <entityType>03</entityType>
  359. <createdDate>09/02/2004</createdDate>
  360. <lastUpdateDate>04/14/2023</lastUpdateDate>
  361. <source>PEP</source>
  362. <OriginalSource>PEP</OriginalSource>
  363. <dobs>
  364. <dob Y="1948">07/03/1948</dob>
  365. </dobs>
  366. <pobs>
  367. <pob>Budapest, , Hungary</pob>
  368. </pobs>
  369. <titles>
  370. <title>DEPUTY PRIME MINISTER OF HUNGARY, EFFECTIVE FROM MAY 04, 2018.</title>
  371. </titles>
  372. <sdfs>
  373. <sdf name="OtherInformation">Career: Deputy Prime Minister, effective from May 04, 2018; Minister of Interior, effective from May 29, 2010; Minister of Interior (July 08, 1998 - May 27, 2002); Chief of the Hungarian National Police (September 18, 1991 - 1996).</sdf>
  374. <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=cd135a22-6242-4999-bc6f-5aae5b0f92e2</sdf>
  375. <sdf name="EffectiveDate">2018</sdf>
  376. <sdf name="EntityLevel">National</sdf>
  377. <sdf name="Gender">MALE</sdf>
  378. <sdf name="NameSource">Website</sdf>
  379. <sdf name="Org_PID">2544374</sdf>
  380. <sdf name="OriginalID">11549</sdf>
  381. <sdf name="Relationship">Father</sdf>
  382. <sdf name="SubCategory">Govt Branch Member</sdf>
  383. </sdfs>
  384. <addresses>
  385. <address>
  386. <address1>TEVE U. 4-6.</address1>
  387. <city>BUDAPEST</city>
  388. <country>HU</country>
  389. <countryName>HUNGARY</countryName>
  390. <postalCode>1139</postalCode>
  391. </address>
  392. <address>
  393. <address1>JOZSEF ATTILA U. 2-4.</address1>
  394. <city>BUDAPEST</city>
  395. <country>HU</country>
  396. <countryName>HUNGARY</countryName>
  397. <postalCode>1051</postalCode>
  398. </address>
  399. </addresses>
  400. </entity>
  401. </entities>
  402. </gwl>

下面是用StAX解析的方法解析出上述xml文件里标签为entity的所有内容,并均匀写入7个新的xml文件中,并且每个新的xml文件都是自定义固定的格式:

  1. import java.io.FileInputStream;
  2. import java.io.FileOutputStream;
  3. import java.io.InputStream;
  4. import java.io.OutputStream;
  5. import javax.xml.stream.XMLInputFactory;
  6. import javax.xml.stream.XMLOutputFactory;
  7. import javax.xml.stream.XMLStreamConstants;
  8. import javax.xml.stream.XMLStreamException;
  9. import javax.xml.stream.XMLStreamReader;
  10. import javax.xml.stream.XMLStreamWriter;
  11. public class StAXParserTest {
  12. public static void main(String[] args) {
  13. String inputFile = "D:\\Desktop\\PEP\\ENTITY.XML"; // 输入XML文件路径
  14. String outputPrefix = "D:\\Desktop\\PEP\\"; // 输出XML文件前缀
  15. int numFiles = 7; // 新文件数量
  16. try {
  17. // 创建XML输入工厂和读取器
  18. XMLInputFactory inputFactory = XMLInputFactory.newInstance();
  19. //创建输入流
  20. InputStream inputStream = new FileInputStream(inputFile);
  21. //使用输入工厂创建XMLStreamReader
  22. XMLStreamReader reader = inputFactory.createXMLStreamReader(inputStream);
  23. // 创建XML输出工厂和写入器数组
  24. XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
  25. //创建输出流数组:
  26. OutputStream[] outputStreams = new OutputStream[numFiles];
  27. //创建XMLStreamWriter数组
  28. XMLStreamWriter[] writers = new XMLStreamWriter[numFiles];
  29. for (int i = 0; i < numFiles; i++) {
  30. String outputFileName = outputPrefix + (i + 1) + ".xml";
  31. outputStreams[i] = new FileOutputStream(outputFileName);
  32. writers[i] = outputFactory.createXMLStreamWriter(outputStreams[i]);
  33. //开始编写XML文件刚开始头部 如:<?xml version='1.0' encoding='UTF-8'?>
  34. writers[i].writeStartDocument("UTF-8", "1.0");
  35. //此处为加了一个回车
  36. writers[i].writeCharacters("\n");
  37. //创建了GWL标签
  38. writers[i].writeStartElement("gwl");
  39. writers[i].writeCharacters("\n");
  40. //创建了Version标签,并在Version标签内增加值
  41. writers[i].writeStartElement("version");
  42. writers[i].writeCharacters("20230417084108");
  43. //Version标签结束,增加回标签</Version>
  44. writers[i].writeEndElement();
  45. writers[i].writeCharacters("\n");
  46. writers[i].writeStartElement("entities");
  47. }
  48. // 解析XML并写入新文件
  49. int currentFileIndex = 0;
  50. int entityCount = 0;
  51. while (reader.hasNext()) {
  52. int event = reader.next();
  53. switch (event) {
  54. case XMLStreamConstants.START_ELEMENT:
  55. String elementName = reader.getLocalName();
  56. if ("entity".equals(elementName)) {
  57. // 解析entity元素及其子元素
  58. writeEntityElement(reader, writers[currentFileIndex]);
  59. entityCount++;
  60. // 切换到下一个文件
  61. currentFileIndex = (currentFileIndex + 1) % numFiles;
  62. }
  63. break;
  64. }
  65. }
  66. // 关闭写入器和输出流
  67. for (int i = 0; i < numFiles; i++) {
  68. writers[i].writeCharacters("\n");
  69. //entities回标签
  70. writers[i].writeEndElement(); // entities
  71. writers[i].writeCharacters("\n");
  72. //gwl回标签
  73. writers[i].writeEndElement(); // gwl
  74. writers[i].writeCharacters("\n");
  75. writers[i].writeEndDocument();
  76. writers[i].flush();
  77. writers[i].close();
  78. outputStreams[i].close();
  79. }
  80. // 关闭输入流
  81. inputStream.close();
  82. System.out.println("entity总数量: " + entityCount);
  83. System.out.println("Entities per file: " + (entityCount / numFiles));
  84. } catch (Exception e) {
  85. e.printStackTrace();
  86. }
  87. }
  88. private static void writeEntityElement(XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
  89. writer.writeCharacters("\n");
  90. //开始写入Entity标签
  91. writer.writeStartElement("entity");
  92. // 写入entity元素的属性
  93. int attributeCount = reader.getAttributeCount();
  94. //读取entity标签内的属性值: attributeName为id/version attributeValue则为值
  95. for (int i = 0; i < attributeCount; i++) {
  96. String attributeName = reader.getAttributeLocalName(i);
  97. String attributeValue = reader.getAttributeValue(i);
  98. writer.writeAttribute(attributeName, attributeValue);
  99. }
  100. // 解析entity元素的子元素
  101. while (reader.hasNext()) {
  102. int event = reader.next();
  103. switch (event) {
  104. case XMLStreamConstants.START_ELEMENT:
  105. //获取当前开始的元素的名称
  106. String childElementName = reader.getLocalName();
  107. //写入开始元素的代码
  108. writer.writeStartElement(childElementName);
  109. break;
  110. case XMLStreamConstants.END_ELEMENT:
  111. String endElementName = reader.getLocalName();
  112. //写入结束元素的代码
  113. writer.writeEndElement();
  114. if ("entity".equals(endElementName)) {
  115. // entity元素解析完毕,结束写入
  116. return;
  117. }
  118. break;
  119. case XMLStreamConstants.CHARACTERS:
  120. String text = reader.getText();
  121. writer.writeCharacters(text);
  122. break;
  123. }
  124. }
  125. }
  126. }

上述示例截取的xml文件中一共8个entity元素,解析完成后,7个xml文件中每个文件平均存入一条,多余出来的1条依次存入,所以第一个xml文件里是2条,其他6个里面只有一条数据

我完整的解析了4GB大小的Entity.xml文件,不会存在内存溢出的问题,解析速度也很快!

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号