2018 年的文章, Using deep neural networks to hunt malicious TLS certificates from:https://techxplore.com/news/2018-10-deep-neural-networks-malicious-tls.html 使用LSTM对恶意证书进行分类,准确率94% 下面是介绍。
Moreover, encryption can give online users a false sense of security, as many web browsers display a green lock symbol when the connection to a website is encrypted, even when these websites are actually executing phishing attacks. To address these challenges, researchers are exploring new ways of detecting and responding to malicious online traffic.
"We are seeing an increase in the sophistication of phishing attacks over the last 12 months," Alejandro Correa Bahnsen, one of the researchers who carried out the study, told TechXplore. "In particular, attackers started using web certificates to make end users believe that they are entering a secure website."
As there is currently no way to detect TLS certificates in the wild, the researchers developed a new method to identify the malicious use of web certificates, using deep neural networks. Essentially, their system uses the content of TLS certificates to successfully identify legitimate certificates and malicious ones.
"The use of web certificates by attackers is increasing the efficiency of their attacks, but at the same time, it leaves more traces of their actions," Bahnsen said. "With these additional data points, we created a deep neural network to find hidden malicious patterns in web certificates and use them to predict the legitimacy of a web site."
Bahnsen and his colleagues evaluated their new method and compared it to an existing model, namely Splunk's support vector machines (SVM) algorithm. Their deep neural network used the text information contained in the certificate more effectively than SVM, identifying malware certificates with an accuracy of 94.87 percent (7 percent more than SVM) and phishing certificates with an accuracy of 88.64 percent (5 percent more than SVM).
paper地址:http://delivery.acm.org/10.1145/3280000/3270105/p64-torroledo.pdf?ip=103.218.216.118&id=3270105&acc=ACTIVE%20SERVICE&key=5A3314F2D74B117C%2E5A3314F2D74B117C%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1545643147_bc7498b8c52551013f7810ed29a5d731
这样说来,splunk也使用了SVM来检测恶意加密流量???
该文章提到的相关工作研究:
Traditional malware detection has been done either by manual methods or by analyzing the traffic payload using expert rules [30]. Unfortunately, those traditional methods cannot work with encrypted content. Recent work has focused on detecting malicious encrypted traffic by analyzing network connections in real time. This is done by investigating the encrypted malware communication with C2 servers, identifying the destination of such communication and then a DNS sinkhole is created to redirect the malware communication away from the C2 servers [27]. This represents a reactive approach because it must allow the malware to infect, propagate and execute its harmful action before it can be stopped. Furthermore, this approach needs to decrypt the communication in order to perform analysis of the malware’s content [34]. Another approach is based on certificate and IP address pivoting to keep track of threat actor infrastructure. Classification strategy for this approach is done by the use of internet scanning and blacklisting of IP addresses and certificates, so when a new connection is coming from any blacklisted IP or uses a known malicious certificate, the connection is classified as malicious [3, 29]. As machine learning starts to become a more popular technique for encrypted traffic analysis, other work has shifted focus to connection metadata analysis. These approaches can predict when a connection is potentially harmful and keep track of threat actor infrastructure [3, 4]. Most recent work avoids the pivoting and starts with a focus only on certificates by looking at digital certificates data. For example, researchers from the security company Splunk were able to achieve a 91% accuracy by classifying certificates used in malware activities by using a support vector machines (SVM) algorithm [32].
[3] Blake Anderson and David McGrew. 2016. Identifying Encrypted Malware Traffic with Contextual Flow Data. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security (AISec ’16). ACM, New York, NY, USA, 35–46. https://doi.org/10.1145/2996758.2996768
[4] Blake Anderson and David McGrew. 2017. Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 1723–1732. https://doi.org/10.1145/3097983.3098163
[32] Dave Herrald Ryan Kovar. 2018. The “Hidden Empires” of Malware. Retrieved June 18, 2018 from https://www.sans.org/summit-archives/file/summit-archive-1517253771.pdf
文章中还提到了钓鱼网站的检测,如果后续做的话,可以看下。
此外,文中说到攻击者通常也使用自签名证书作为免费生成的证书,因为它们快速而便宜生成。但是,通过使用此类证书,攻击者可以公开他们的意图,让他们容易被发现,追踪并列入黑名单。
文中指出恶意软件的签名特征:请记住,信息较少的证书更多可疑。攻击者不会花时间或金钱购买并验证证书,因为它可能会减少他们的收入和暴露他们的意图。我们注意到恶意软件和网络钓鱼证书几乎总是缺少几个信息域。当证书被自签名并免费获得时,检查有效期。我们也注意到恶意软件中重复了一些信息字段。看文章一个表格就知道:common name(CN)里其实包含了大量的有趣信息:
Table 4: Most common CN found in certificates.
Malware Phishing Domain Name % Domain Name %
No CN 30.8% incapsula.com 1.8% example.com 8.5% localhost 1.4% localhost 6.0%
No CN 1.1% domain.com 4.6% Parallels Panel 0.7% www.example.com 1.1% localhost.localdomain 0.5%
证书验证情况:
Validation certificates by certificate category. DV OV EV No Validation
Legitimate 32.6% 4.0% 7.7% 55.7%
Phishing 9.0% 0.6% 0.01% 90.0%
Malware 9.7% 0.0% 0.0% 91.0%
补充:
目前主流的SSL证书主要分为DV SSL 、 OV SSL 、EV SSL。
DV SSL
DV SSL证书是只验证网站域名所有权的简易型(Class 1级)SSL证书,可10分钟快速颁发,能起到加密传输的作用,但无法向用户证明网站的真实身份。
目前市面上的免费证书都是这个类型的,只是提供了对数据的加密,但是对提供证书的个人和机构的身份不做验证。
OV SSL
OV SSL,提供加密功能,对申请者做严格的身份审核验证,提供可信身份证明。
和DV SSL的区别在于,OV SSL 提供了对个人或者机构的审核,能确认对方的身份,安全性更高。
所以这部分的证书申请是收费的~
EV SSL
超安=EV=最安全、最严格 超安EV SSL证书遵循全球统一的严格身份验证标准,是目前业界安全级别最高的顶级 (Class 4级)SSL证书。
金融证券、银行、第三方支付、网上商城等,重点强调网站安全、企业可信形象的网站,涉及交易支付、客户隐私信息和账号密码的传输。
这部分的验证要求最高,申请费用也是最贵的。
常见的颁发证书机构
赛门铁克(Symantec)是 SSL/TLS 证书的领先提供商
中国金融认证中心(CFCA)全球信任SSL证书
GeoTrust是全球第二大数字证书颁发机构
文章模型提取的特征:
Feature Name Description Category
SubjectCommonNameIp Indicates if CN is an IP address instead of domain Boolean
Is_extended_validated Indicates if certificate is extended validated Boolean
Is_organization_validated Indicates if certificate is organization validated Boolean Is_domian_validated Indicates certificate is domain validated Boolean SubjectHasOrganization Indicates if subject principal has O field Boolean IssuerHasOrganization Indicates if issuer principal has O field Boolean SubjectHasCompany Indicates if subject principal has CO field Boolean IssuerHasCompany Indicates if issuer principal has CO field Boolean SubjectHasState Indicates if subject principal has ST field Boolean IssuerHasState Indicates if issuer principal has ST field Boolean SubjectHasLocation Indicates if subject principal has L field Boolean IssuerHasLocation Indicates if issuer principal has L field Boolean Subject_onlyCN Indicates if subject principal has only CN field Boolean Subject_is_com Indicates if subject CN is a ”.com” domain Boolean Issuer_is_com Indicates if issuer CN is a ”.com” domain Boolean HasSubjectCommonName Indicates if CN is present in subject principal Boolean HasIssuerCommonName Indicates if CN is present in issuer principal Boolean Subject_eq_Issuer Boolean indicating if Subject Principal = Issuer Principal Boolean SubjectElements Number of details present in subject principal Splunk IssuerElements Number of details present in issuer principal Splunk SubjectLength Number of characters of whole subject principal string Splunk IssuerLength Number of characters of whole issuer principal string Splunk ExtensionNumber Number of extensions contained in the certificate Splunk Selfsigned Indicates if certificate is self signed SOC Is_free Indicates if the certificate is free generated SOC DaysValidity Calculated days between not before and not after days SOC Ranking_C Calculated ranking of domain based on domain ranking SOC SubjectCommonName Calculated character entropy in the subject CN text Euclidian_Subject_Subjects Calculated euclidean distance of subject among all subjects Text Euclidian_Subject_English Calculated euclidean distance of subject characters among English characters Text Euclidian_Issuer_Issuers Calculated euclidean distance of issuer among all issuers Text Euclidian_Issuer_English Calculated euclidean distance of issuer characters among English characters Text Ks_stats_Subject_Subjects Kolmogorov-Smirnov statistics for subject in subjects Text Ks_stats_Subject_English Kolmogorov-Smirnov statistic for subject in English characters Text Ks_stats_Issuer_Issuers Kolmogorov-Smirnov statistics for issuers in issuers Text Ks_stats_Issuer_English Kolmogorov-Smirnov statistic for issuer in English characters Text Kl_dist_Subject_Subjects Kullback-Leiber Divergence for subject in subjects Text Kl_dist_Subject_English Kullback-Leiber Divergence for subject in English characters Text Kl_dist_Issuer_Issuers Kullback-Leiber Divergence for issuer in Issuers Text Kl_dist_Issuer_English Kullback-Leiber Divergence for issuer in English characters Text
有点多。
样本和实验数据:
To train our classification models, a dataset of legitimate, phishing and malware certificates is created. The phishing certificates come from Vaderetro an internal feed that gave us confirmed phishing certificates. We also extracted malware certificates from abuse.ch project and censys.io , they gave us blacklisted certificates and pem files. Finally, legitimate certificates came from Alexa top one million5 rank who provided us with those website certificates. Our dataset has a total of 5,000 phishing certificates, 3,000 malware certificates and 1,000,000 legitimate certificates. 比后面splunk的svm感觉还是要完善些。样本更多了。
https://www.sans.org/cyber-security-summit/archives/file/summit-archive-1517253771.pdf 使用Slpunk做SSL恶意检测的:本质上还是在使用SSL证书进行检测,你看他的特征就知道了。
可以查看ssl证书安全性的网站:
https://censys.io/certificates?q=ee5efc7223434aee0547df8914873463038cb93d
SSL数据集:https://opendata.rapid7.com/sonar.ssl/ 全网的SSL数据
October 30, 2013 – Present ▶ Raw size • Entire data set: 315 GB compressed (as of 02JAN2017) • Weekly: ~1.5 - 2.0 GB compressed ▶ Entire data set indexed in Splunk: ~1.2TB ▶ Scan the entire Internet (TCP/443 only) ▶ Comprised of: • Observed certificates * • Observed IP address / certificate * • Names (FQDNS) • Endpoints
https://sslbl.abuse.ch/blacklist/sslblacklist.csv 这个是目前探测到的恶意SSL sha1 哈哈,这下就知道如何做分类了吧!!!我看splunk是提取如下特征:
Features
Number of certificate extensions
Number of Issuer elements
Number of Subject elements
Length of Extensions
Length of Issuer
Length of Subject Shannon
Entropy of Subject Common Name
使用splunk 语句:
index=*blcertdetails | spath | eval sha1=coalesce(sha1, hash) | lookup sslblacklist.csv sha1 | eval blacklist=case(isnull(reason), "False", true(), "True") | spath input=_raw output=extlist path="extensions" | eval extlist=replace(extlist,"[\{\}]", "") | eval extlen=len(extlist) | makemv delim="\", \"" extlist | eval extcount=mvcount(extlist) | spath input=_raw output=isslist path="issuer" | eval isslist=replace(isslist,"[\{\}]", "") | eval isslen=len(isslist) | makemv delim="\", \"" isslist | eval isscount=mvcount(isslist) | spath input=_raw output=sublist path="subject" | eval sublist=replace(sublist,"[\{\}]", "") | eval sublen=len(sublist) | makemv delim="\", \"" sublist | eval subcount=mvcount(sublist) | `ut_shannon(subject.CN)` | fillnull value=0 ut_shannon | eval subcnshannon=ut_shannon | table sha1 blacklist reason extcount extlen isscount isslen subcount sublen subcnshannona
模型:
Categorical Prediction Algorithm Accuracy FP Rate
Logistic Regression 0.75 24.90%
Support Vector Machine (SVM) 0.91 4.90%
Random Forest Classifier 0.91 8.10%
Gaussian Naive Bayes (GaussianNB) 0.71 18.40%
Decision Tree Classifier 0.91 9.80%
看来还是SVM要好。
来看看思科的文章:
Machine Learning for Encrypted Malware Traic Classification: Accounting for Noisy Labels and Non-Stationarity 链接:http://delivery.acm.org/10.1145/3100000/3098163/p1723-anderson.pdf?ip=103.218.216.118&id=3098163&acc=ACTIVE%20SERVICE&key=5A3314F2D74B117C%2E5A3314F2D74B117C%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1545706495_c9e1206db619ebde472f6a6238fd7c22
machine learning on the encrypted network session’s metadata is a natural solution. While not applied directly to detecting threats in encrypted trac, this basic formula of machine learning and network metadata has been well-researched [6, 24, 31]. Unfortunately, these solutions have been slow to materialize as viable methods for real-world threat detection, and some critics have rightfully called into question the applicability of machine learning for this problem domain [25, 37]. 说使用AI进行加密流量检测难以在工业应用,原因是
Suitable false positive rates, while still maintaining high true positive rates on novel threats, has been dicult to achieve. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and non-stationarity in network data. The most straightforward method to acquire labeled data for training is to use a sandbox environment to run malware and collect the sample’s associated packet capture les for positively-labeled, malicious data, and to monitor a network and collect all connections for negatively-labeled, benign data. For the benign case, even after ltering the dataset using an IP blacklist [13], there will typically be a non-negligible percentage of network trac that would be considered suspicious. For the malicious case, malware samples often perform connectivity checks, or other inherently benign activities. It is nearly impossible to identify all of these cases, and this must be taken into account when using supervised learning。 原因就是攻击不对等,你不知道所有的攻击情况。就算是你在沙箱里跑了黑样本,比拿到白样本,然后训练模型,但还是会有大量的正常流量为识别为恶意的。
Malware typically performs connectivity checks by
visiting a standard website, e.g., https://www.google.com. 提到恶意软件可能会访问知名网站。
特征: But, if additional features about the connection are included, such as the TLS handshake metadata, it becomes possible to distinguish these
two cases because the TLS features provide information about the originating client. TLS的握手信息比较有用。
数据提取:一年采集样本,使用沙箱(不同地理位置)来获取恶意tls流,Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.
Detecting malware even when it is encrypted 下载地址:https://2018.bsidesbud.com/wp-content/uploads/2018/03/seba_garcia_frantisek_strasak.pdf 他们使用开源的数据集来进行恶意ssl的识别,模型使用xgboost和RF,svm,MLP
恶意软件的pcap包数据集:
Dataset ● CTU-13 dataset - public ○ Malware and Normal captures ○ 13 Scenarios. 600GB pcap ○ https://www.stratosphereips.org/datasets-ctu 13/ ● MCFP dataset - public ○ Malware Capture Facility Project. (Maria Jose Erquiaga) ○ 340 malware pcap captures ○ https://stratosphereips.org/category/dataset. html ● Own normal dataset - public ○ 3 days of accessing to secure sites (Alexa 1000) ○ Google, Facebook, Twitter accounts ○ https://stratosphereips.org/category/data set.html ● Normal CTU dataset - almost public ○ Normal captures ○ 22 known and trusted people from department of FEE CTU
https://www.stratosphereips.org/datasets-malware/ 我擦,发现还真可以下载!!!
特征:
Top 7 most discriminant features 1. Certificate length of validity 2. Inbound and outbound packets 3. Validity of certificate during the capture 4. Duration 5. Number of domains in certificate (SAN DNS) 6. SSL/TLS version 7. Periodicity
效果:
XGBoost ○ Cross validation accuracy: 92.45% ○ Testing accuracy: 94.33% ○ False Positive Rate: 5.54% ○ False negative rate: 10.11% ○ Sensitivity: 89.89% ○ F1 Score: 46.96 %
● Random Forest ○ Cross validation accuracy: 91.21% ○ Testing accuracy: 95.65% ○ False Positive Rate: 4.05% ○ False negative rate: 14.82% ○ Sensitivity: 85.18% ○ F1 Score: 52.24%
重要发现:
Malware and Certificates ● Certificates used by Malware in Alexa 1000 ~ 50% ● Certificates used by Normal in Alexa 1000 ~ 30%
The certificates used by Malware are mostly from normal sites! 恶意软件使用的证书竟然在alexa中使用!!!
Detecting Malignant TLS Servers Using Machine Learning Techniques https://arxiv.org/ftp/arxiv/papers/1705/1705.09044.pdf
摘要:
TLS使用X.509证书进行服务器身份验证。 X.509证书是一个复杂的文档,在创建/使用它时可能会出现各种无辜的错误。此外,许多证书属于恶意网站,应该被客户拒绝,不应访问这些Web服务器。通常,当客户端使用传统测试发现可疑的证书时,它会要求人为干预。但是,查看证书,大多数人无法区分恶意网站和非恶意网站。因此,一旦传统的证书验证失败,我们使用机器学习技术来使网络浏览器决定证书所属的服务器是否是恶性的,即是否应访问网站或不。一旦证书在上述阶段被接受,我们发现该网站可能仍然是恶意的。因此,在第二阶段,我们在沙箱中下载部分网站而不对其进行解密,并观察TLS加密流量(在沙箱中捕获的加密恶意数据不会损害系统)。由于握手完成后流量被加密,因此不能采用传统的模式匹配技术。因此,我们使用流量的流量特征以及上述第一阶段中使用的特征。我们将这些功能与在TLS握手期间获得的未加密的TLS头信息结合起来,并在机器学习分类器中使用这些信息来识别流量是否是恶意的。——先使用决策树来判定证书是否可疑,然后再使用贝叶斯网络看tls流量是否恶意!!!
在这个文档里,提到:
According to [9], a subordinate CA of ANSSI issued an intermediate certificate that they installed on a network monitoring device, which enabled the device to act as a MITM of domains or websites that the certificate holder did not own or control. In early 2011, a hacker hacked the DigiNotar CA and issued certificates for *.google.com, *.skype.com and *.*.com, as well as few intermediate CA certificates carrying the names of well-known roots. The *.google.com certificate was used to launch a MITM attack against Gmail users in Iran. That is, the attackers were able to create both CA and leaf certificates through an existing CA. [16] describes this attack.
2011年初,一名黑客攻击了DigiNotar CA,并为* .google.com,* .skype.com和*。*。com颁发了证书,以及一些带有众所周知根源的中间CA证书。 * .google.com证书用于针对伊朗的Gmail用户发起MITM攻击。 也就是说,攻击者能够通过现有CA创建CA和叶证书。 [16]描述了这种攻击。 我擦,黑客攻击CA服务器窃取证书用于攻击。。。是不是和上面的说的一回事???
文章恶意流量识别提取的特征(使用贝叶斯网络):
1. Features of Classifier of Phase 1: The above features used in Phase 1 are also used in Phase 2. They are the reasons for the certificate failing the traditional certificate validation and whether the server certificate is self-signed.
2. Flow Metadata: Traditional flow data are the first set of additional features for the classifier. They are the number of inbound bytes, outbound bytes, inbound packets, outbound packets; the source and destination ports; and the total duration of the flow in seconds.
3. Packet Lengths and Packet Inter – Arrival Times: Minimum, Maximum, Mean and Standard Deviation of Packet Lengths and Minimum, Maximum, Mean and Standard Deviation of Packet Inter – Arrival Times during the duration of flow are taken as the second set of additional features for the classifier.
4. Unencrypted TLS Header Information exchanged during TLS Handshake: 4a. Critical extensions: Malicious servers rarely select TLS extensions. Legitimate servers select different TLS extensions. 0Xff01 (renegotiation_info) and 0x000b (ec_point_formats) are most common. Usually, 21 unique extensions are observed, most of them in legitimate traffic. A binary vector of length 21 was created with a true (1) if extension is present and a false (0) if it is absent. 4b. Weak ciphersuite: Approx 90% of the malicious servers use one of the following ciphersuite: 0x000a (TLS_RSA_WITH_3DES_EDE_CBC_SHA), 0x0004 (TLS_RSA_WITH_RC4_128_MD5), 0x006b (TLS_DHE_RSA_WITH_AES_256_CBC_SHA256) and 0x0005 (TLS_RSA_WITH_RC4_128_SHA). TLS_RSA_WITH_RC4_MD5 and TLS_RSA_WITH_RC4_128_SHA are considered weak. A numeric value is assigned to ciphersuite to identify which ciphersuite server will be used. It helps identify malicious traffic.
可能有用的参考文献:
[3] Sheffer, Y., Holz, R., Saint-Andre, P.: Summarizing Known Attacks on Transport Layer Security (TLS) and Datagram TLS (DTLS) (2015), RFC 7457
[10] Anderson, B., Paul, S., McGrew, D.: Deciphering Malware's use of TLS (without Decryption). In: arXiv:1607.01639v1 (2016)