自动文本摘要早在20世纪50年代就引起了人们的注意。汉斯•彼得•鲁恩(Hans Peter Luhn)在20世纪50年代末发表了一篇研究论文,题为《文学文摘的自动创作》(the automatic creation of literature abstracts)。该论文利用词频和短语频等特征,从文本中提取重要句子进行总结。
另一项重要的研究是Harold P Edmundson在20世纪60年代末所做的,该研究利用线索词的出现、出现在文章标题中的词以及句子的位置等方法,提取出有意义的句子进行文本总结。从那时起,许多重要和令人兴奋的研究已经发表,以解决自动文本摘要的挑战。
- import numpy as np
- import pandas as pd
- import nltk
- nltk.download('punkt') # one time execution
- import re
- df = pd.read_csv("tennis_articles_v4.csv")
- df.head()
- from nltk.tokenize import sent_tokenize
- sentences = []
- for s in df['article_text']:
- sentences.append(sent_tokenize(s))
- sentences = [y for x in sentences for y in x] # flatten list
GloVe词嵌入是词的向量表示。这些词的嵌入将被用来为我们的句子创建向量。我们也可以使用单词包或TF-IDF方法为句子创建特征,但是这些方法忽略了单词的顺序(特征的数量通常相当大)。我们将使用预先培训的维基百科2014 + Gigaword5 GloVe矢量,这些单词嵌入的大小是822 MB。
- !wget http://nlp.stanford.edu/data/glove.6B.zip
- !unzip glove*.zip
- # Extract word vectors
- word_embeddings = {}
- f = open('glove.6B.100d.txt', encoding='utf-8')
- for line in f:
- values = line.split()
- word = values[0]
- coefs = np.asarray(values[1:], dtype='float32')
- word_embeddings[word] = coefs
- f.close()
- # remove punctuations, numbers and special characters
- clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
- # make alphabets lowercase
- clean_sentences = [s.lower() for s in clean_sentences]
- nltk.download('stopwords')
- from nltk.corpus import stopwords
- stop_words = stopwords.words('english')
- # function to remove stopwords
- def remove_stopwords(sen):
- sen_new = " ".join([i for i in sen if i not in stop_words])
- return sen_new
- # remove stopwords from the sentences
- clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
- # Extract word vectors
- word_embeddings = {}
- f = open('glove.6B.100d.txt', encoding='utf-8')
- for line in f:
- values = line.split()
- word = values[0]
- coefs = np.asarray(values[1:], dtype='float32')
- word_embeddings[word] = coefs
- f.close()
- sentence_vectors = []
- for i in clean_sentences:
- if len(i) != 0:
- v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
- else:
- v = np.zeros((100,))
- sentence_vectors.append(v)

- # similarity matrix
- sim_mat = np.zeros([len(sentences), len(sentences)])
- from sklearn.metrics.pairwise import cosine_similarity
- for i in range(len(sentences)):
- for j in range(len(sentences)):
- if i != j:
- sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
- import networkx as nx
- nx_graph = nx.from_numpy_array(sim_mat)
- scores = nx.pagerank(nx_graph)
- #Summary Extraction
- ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
- # Extract top 10 sentences as the summary
- for i in range(10):
- print(ranked_sentences[i][1])
- When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person
- whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the
- weather and know that in the next few minutes I have to go and try to win a tennis match.
- Major players feel that a big event in late November combined with one in January before the Australian Open will
- mean too much tennis and too little rest.
- Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius
- Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of
- any commitment.
- "I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the
- Olympic weeks, not necessarily during the tournaments.
- Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event
- in London next month.
- He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the
- win on his first match point.
- The Spaniard broke Anderson twice in the second but didn't get another chance on the South African's serve in the
- final set.
- "We also had the impression that at this stage it might be better to play matches than to train.
- The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will replace
- the classic home-and-away ties played four times per year for decades.
- Federer said earlier this month in Shanghai in that his chances of playing the Davis Cup were all but non-existent.

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。