当前位置:   article > 正文

15个超实用Python文本处理案例分享,快点码住!_python关于文件经典例子

python关于文件经典例子

Python 处理文本是一项非常常见的功能,本文整理了多种文本提取及NLP相关的案例,还是非常用心的

文章很长,高低要忍一下,如果忍不了,那就收藏吧,总会用到的!

 1.提取 PDF 内容

  1. # pip install PyPDF2 安装 PyPDF2
  2. import PyPDF2
  3. from PyPDF2 import PdfFileReader
  4. # Creating a pdf file object.
  5. pdf = open("test.pdf", "rb")
  6. # Creating pdf reader object.
  7. pdf_reader = PyPDF2.PdfFileReader(pdf)
  8. # Checking total number of pages in a pdf file.
  9. print("Total number of Pages:", pdf_reader.numPages)
  10. # Creating a page object.
  11. page = pdf_reader.getPage(200)
  12. # Extract data from a specific page number.
  13. print(page.extractText())
  14. # Closing the object.
  15. pdf.close()

2.提取 Word 内容

  1. # pip install python-docx 安装 python-docx
  2. import docx
  3. def main():
  4. try:
  5. doc = docx.Document('test.docx') # Creating word reader object.
  6. data = ""
  7. fullText = []
  8. for para in doc.paragraphs:
  9. fullText.append(para.text)
  10. data = '\n'.join(fullText)
  11. print(data)
  12. except IOError:
  13. print('There was an error opening the file!')
  14. return
  15. if __name__ == '__main__':
  16. main()

3.提取 Web 网页内容

  1. # pip install bs4 安装 bs4
  2. from urllib.request import Request, urlopen
  3. from bs4 import BeautifulSoup
  4. req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
  5. headers={'User-Agent': 'Mozilla/5.0'})
  6. webpage = urlopen(req).read()
  7. # Parsing
  8. soup = BeautifulSoup(webpage, 'html.parser')
  9. # Formating the parsed html file
  10. strhtm = soup.prettify()
  11. # Print first 500 lines
  12. print(strhtm[:500])
  13. # Extract meta tag value
  14. print(soup.title.string)
  15. print(soup.find('meta', attrs={'property':'og:description'}))
  16. # Extract anchor tag value
  17. for x in soup.find_all('a'):
  18. print(x.string)
  19. # Extract Paragraph tag value
  20. for x in soup.find_all('p'):
  21. print(x.text)

4.读取 Json 数据

  1. import requests
  2. import json
  3. r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
  4. res = r.json()
  5. # Extract spec
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/514644
推荐阅读
相关标签
  

闽ICP备14008679号