Guff_9hys

这个屌丝很懒，什么也没留下！

热门标签

谷歌云GCP_gcp pricing

作者：Guff_9hys | 2024-08-22 05:39:35

踩

gcp pricing

感谢公司赞助了Google Cloud Platform(GCP) Coursera课程：https://www.coursera.org/，包括云基础设施，应用开发，数据湖和数据仓库相关知识。

其中谷歌云的实验操作平台是：https://www.qwiklabs.com/，获得的谷歌云Coursera认证（该认证包括Qwiklabs平台的实验）如下：

2020/3/26-2020/4/1	Essential Google Cloud Infrastructure: Core Services	Certificated
2020/4/2-2020/4/5	Essential Google Cloud Infrastructure: Foundation	Certificated
2020/4/6-2020/4/11	Essential Google Cloud Infrastructure: Core Services	Certificated
2020/4/12-2020/4/16	Elastic Google Cloud Infrastructure: Scaling and Automation	Certificated
2020/4/17-2020/4/21	Reliable Google Cloud Infrastructure: Design and Process	Certificated
2020/4/22-2020/4/26	Getting Started With Application Development	Certificated
2020/4/24-2020/5/10	Modernizing Data Lakes and Data Warehouses with GCP	Certificated
	Building Batch Data Pipelines on GCP

也推荐 John J. Geewax 写的《Google Cloud Platform in Action》这本书作为参考阅读

Google Kubernetes Engine (GKE)

Cloud Composer & Apache Airflow

Cloud Run & Cloud Functions & App Engine

Management Tool

Pricing

谷歌云首页：https://cloud.google.com/

首先，GCP是Google Cloud Platform，谷歌云平台的缩写，GCP主要包括 Compute，Storage，Big Data ，Machine Learning (AI) 四大类服务，其他还有Networking，Pricing，SDK，Management Tool，IoT，Mobile 等分类。

什么是云计算

云计算特点

按需自助服务
无处不在的网络访问
与位置无关的资源池
快速弹性
按使用交费

云计算分类

按照云计算的服务模式，大体可以分为：IaaS、PaaS、SaaS三层

IaaS: Infrastructure as a Servic

基础设施即服务，通过网络向用户提供IT基础设施能力的服务（计算，存储，网络等）。

PaaS: Platform as a Service

平台即服务，指的是在云计算基础设施之上，为用户提供应用软件部署和运行环境的服务。

SaaS: Software as a Service

软件即服务，是指基于网络提供软件服务的软件应用模式。

用盖房子打个比方：IaaS就好比只提供一片土地，用户买下之后，所有的工作还得用户自己去做，PaaS就好比在这片土地上给用户建好了楼，用户入住之前只需要自己装修一下，而SaaS不仅帮用户把楼建好，还装修好，用户买下即可拎包入住。

按照云计算的目标用户，分为公有云、私有云、混合云和行业云（专有云）

公有云：一般由云计算服务商构建，面向公众、企业提供公共服务，由云计算服务商运营
私有云：由企业自身构建，为内部使用的云服务
混合云：当企业既有私有云又采用公有云服务时，这两种云之间形成内外数据和应用的互动
行业云：由利益相关、业务相近的组织掌控和使用，例如某省各级政府机关和事业单位共同利用政务专有云进行日常办公及服务大众。

Region and Zone

地域与分区。每个地域下有不同的分区，同一地域内的网络延迟通常在5毫秒以下。为了容灾，可以把我们的应用分布在多个地域。

IAM

Identity and Access Management，即身份识别和访问管理。

它包括三个部分：

Who:
可以通过google account, google group, service account定义。

Can do what: 可以通过 IAM role 定义，它是一个 permissions 的集合。

有三种类型的角色:

Primitive role
Predefined role
Custom role: can only be defined in organization or project, but not in folders

On which resource

GCP资源架构：

polices can define in organization, folder, project, they are inherited in the hierarchy.

Projects are the main way you organise your gcp resources.

每个 Project 有：

Project ID: 不可变的 (assigned by you)
Project Name: 可变的 (assigned by you)
Project number: 不可变的 (assigned by GCP)

Policies defined in organisation level can be inherited to all children.

GCP use least privilege in managing any kind of compute infrastructure.

The policies implemented at a higher level in this hierarchy can’t take away access that’s granted at a lower level
Eg: if you grant Editor role to Organisation and Viewer role to the folder, then the folder is granted the Editor role.

Projects can have different owners and users - they are built separately and managed separately.

When using GCP, it handles most of the lower security layer, the upper layers remain the customer’s responsibility

VPC Network

Virtual Private Cloud: it connects your GCP resources to each other and to the internet.

In the example below, us-east1-b and us-east1-c are on the same subnet but in different zones

VPCs have routing tables, you can define firewall rules in terms of tags on compute engine.

VPC Peering: establish a peering relationship between projects
Shared VPC: you can use IAM to control

GCP服务

1. GCP四大类服务如下：

2. 有四种方式与 GCP 交互：

GCP console

https://cloud.google.com/console

Cloud Shell and Cloud SDK

包括: gcloud, gsutil (Cloud Storage), bq (BigQuery) 等。

如上图所示，点击用户头像旁的激活 Cloud Shell 图标，会在 web 控制台下方出现 shell 命令行。

可以点击“打开编辑器”：

点击“打开终端”按钮即可回到命令行界面。

本地的话，在https://cloud.google.com/sdk/docs/install下载官方Google Cloud SDK程序，Windows需要配置bin路径到PATH，其他系统也需要配置环境变量。

初始化SDK：gloud init

gcloud config list

gcloud info

gcloud compute instances list

gcloud components list

gcloud components update

gcloud auth list

export GOOGLE_APPLICATION_CREDENTIALS等。

APIs Explorer is an interface tool that let you easily try GCP APIs using a browser

https://developers.google.com/apis-explorer

Use libraries within your code
- Cloud Client Libraries: https://cloud.google.com/apis/docs/cloud-client-libraries
- Google API Client Libraries: https://developers.google.com/api-client-library
Cloud Console Mobile App

3. Cloud MarketPlace (Cloud Launcher)

可以在 GCP 上很快部署软件包，比如LAMP (Linux+Apache+MySQL+PHP) 应用。

搭建了 LAMP (Linux + Apache + MySQL + PHP) 的博客案例，最终效果图如下:

Computing

谷歌提供的云计算服务中，归类如下：

Compute Engine属于IaaS，Kubernetes Engine属于Hybrid，App Engine属于PaaS，Cloud Functions属于Serverless。

Google Kubernetes Engine (GKE)

容器编排，可以管理和扩展应用等。Pod 是 Kubernetes 中最小的可部署单元。

In GCP, node is VM running in Compute Engine. The smallest deployable unit in Kubernetes. It has 1 container often, but it could have multiple containers, where the containers
will share the networking and have the same disk storage volume.

Demo及常用命令可查看官方文档 Deploying a containerized web application: https://cloud.google.com/kubernetes-engine/docs/tutorials/hello-app

构建 (build) 和标记 (tag) Docker 映像：

docker build -t gcr.io/${PROJECT_ID}/hello-app:v1 .

运行 docker images 命令以验证构建是否成功：

docker images

本地运行容器（可选）

使用本地 Docker 引擎测试容器映像：

docker run --rm -p 8080:8080 gcr.io/${PROJECT_ID}/hello-app:v1

将 Docker 映像推送到 Container Registry

必须将容器映像上传到 Registry，以便 GKE 集群可以下载并运行该容器映像。在 Google Cloud 中，Container Registry 默认处于启用状态。

为您正在使用的 Google Cloud 项目启用 Container Registry API：
```
gcloud services enable containerregistry.googleapis.com
```
配置 Docker 命令行工具以向 Container Registry 进行身份验证：
```
gcloud auth configure-docker
```
将刚刚构建的 Docker 映像推送到 Container Registry：
```
docker push gcr.io/${PROJECT_ID}/hello-app:v1
```

创建 GKE 集群

创建名为 hello-cluster 的集群：

标准集群：

gcloud container clusters create hello-cluster

Autopilot 集群：

gcloud container clusters create-auto hello-cluster

创建 GKE 集群并进行运行状况检查需要几分钟的时间。
该命令运行完后，请运行以下命令以查看集群的三个工作器虚拟机实例：
```
gcloud compute instances list
```

将应用部署到 GKE

可以将构建的 Docker 映像部署到 GKE 集群。

为 hello-app Docker 映像创建 Kubernetes 部署。
```
kubectl create deployment hello-app --image=gcr.io/${PROJECT_ID}/hello-app:v1
```
以前老版本是 kubectl run

将部署副本的基准数量设置为 3。

kubectl scale deployment hello-app --replicas=3

为您的部署创建一个 HorizontalPodAutoscaler 资源。

kubectl autoscale deployment hello-app --cpu-percent=80 --min=1 --max=5

如需查看已创建的 Pod，请运行以下命令：

kubectl get pods

输出：


NAME                         READY   STATUS    RESTARTS   AGE
hello-app-784d7569bc-hgmpx   1/1     Running   0          10s
hello-app-784d7569bc-jfkz5   1/1     Running   0          10s
hello-app-784d7569bc-mnrrl   1/1     Running   0          15s

部署应用

使用 kubectl expose 命令为 hello-app 部署生成 Kubernetes 服务。

kubectl expose deployment hello-app --name=hello-app-service --type=LoadBalancer --port 80 --target-port 8080

此处，--port 标志指定在负载平衡器上配置的端口号，--target-port 标志指定hello-app容器正在侦听的端口号。

运行以下命令以获取 hello-app-service 的服务详情。
```
kubectl get service
```
将 EXTERNAL_IP 地址复制到剪贴板（例如：203.0.113.0）。

注意：预配负载平衡器可能需要几分钟的时间。在预配负载平衡器之前，您可能会看到 <pending> IP 地址。

现在，hello-app Pod 已通过 Kubernetes 服务公开发布到互联网，您可以打开新的浏览器标签页，然后导航到先前复制到剪贴板中的服务 IP 地址。您会看到一条 Hello, World! 消息以及一个 Hostname 字段。Hostname 对应于向浏览器传送 HTTP 请求的三个 hello-app Pod 中的一个。

部署新版本应用

在本部分中，您将通过构建新的 Docker 映像并将其部署到 GKE 集群，来将 hello-app 升级到新版本。

GKE 的滚动更新功能让您可以在不停机的情况下更新部署。在滚动更新期间，GKE 集群将逐步将现有 hello-app Pod 替换为包含新版本的 Docker 映像的 Pod。在更新期间，负载平衡器服务仅将流量路由到可用的 Pod。

返回到 Cloud Shell，现在您已在其中克隆了 hello 应用源代码和 Dockerfile。更新项目里的文件为新版本 2.0.0。

构建并标记新的 hello-app Docker 映像。

docker build -t gcr.io/${PROJECT_ID}/hello-app:v2 .

将映像推送到 Container Registry。

docker push gcr.io/${PROJECT_ID}/hello-app:v2

现在，您可以更新 hello-app Kubernetes 部署来使用新的 Docker 映像。

通过更新映像，对现有部署进行滚动更新：

kubectl set image deployment/hello-app hello-app=gcr.io/${PROJECT_ID}/hello-app:v2

运行 v1 映像的 Pod 停止运行后，系统会启动运行 v2 映像的新 Pod

watch kubectl get pods

输出：


NAME                        READY   STATUS    RESTARTS   AGE
hello-app-89dc45f48-5bzqp   1/1     Running   0          2m42s
hello-app-89dc45f48-scm66   1/1     Running   0          2m40s

在单独的标签页中，再次导航到 hello-app-service 外部 IP。您现在应该看到 Version 被设置为 2.0.0.。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 帐号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除 Service：此步骤将取消并释放为 Service 创建的 Cloud Load Balancer：
```
kubectl delete service hello-app-service
```
删除集群：此步骤将删除构成集群的资源，如计算实例、磁盘和网络资源：
```
gcloud container clusters delete hello-cluster
```

删除容器映像：此操作会删除推送到 Container Registry 的 Docker 映像。


 gcloud container images delete gcr.io/${PROJECT_ID}/hello-app:v1  --force-delete-tags --quiet
 gcloud container images delete gcr.io/${PROJECT_ID}/hello-app:v2  --force-delete-tags --quiet

以下是我的一个小试验：

在 VM instances里可以看到：

结果如下:

Cloud Storage

对象存储，有 unique key 可以访问对应对象。在 Cloud Storage 中，每个对象都有一个 URL，并且该 URL 不可变。

Cloud Storage 保留修改历史，存储对象历史，我们可以查看版本列表，还原或者删除。

Cloud Storage 提供生命周期管理，比如你可以删除 5 天以前的对象。

用途:

serving website content
storing data for archival and disaster recovery
distributing large data objects to your end users via direct download

For most case, IAM is sufficient, but if you need finer control, you can create ACLs (access control lists).

每个访问控制列表包括：

a user or group
a permission

Cloud Storage 有不同的存储类型: Multi-Regional, Regional, Nearline, Coldline

3 Ways to bring data into Cloud Storage:

Online Transfer
Storage Transfer Service
Transfer Appliance

Data & Analytics

Cloud SQL

RDBMS，目前支持 MySQL，PostgreSQL 和 SQL Server 关系型数据库。数据大小最大是 10 TB，如果数据量大于10 TB，建议选择 Cloud Spanner

Cloud Spanner

horizontally scalable RDBMS

什么时候使用?

A relational database that need strong transactional consistency (ACID)
Wide scale
Higher workload than Cloud SQL

Spanner vs Cloud SQL

Spanner 对 MySQL/PostgreSQL/SQL Server 不兼容

Spanner architecture

Nodes handle computation, each node serves up to 2 TB of storage
Storage is replicated across zones, compute and storage are separated
Replication is automatic

DataStore

NoSQL
- Flexible structure/relationship
No Ops
- No provisioning of instances
- Compute layer is abstracted away
Scalable
- Multi-regions access
- Sharding/replication automatic
每个项目只能有 1 个 Datastore

什么时候使用 Datastore
应用需要扩展
ACID 事务，eg: transferring funds

用例：产品目录 - 实时库存；User profiles - 手机应用；游戏存储状态。

什么时候不使用 Datastore

需要分析 (full SQL semantics)，最好使用 Big Query/Cloud Spanner
需要读写能力 (每秒10M+ read/writes)，最好使用 Bigtable
不需要 ACID时，最好使用 Bigtable
需要迁移比如MySQL时，最好使用 Cloud SQL
要求延迟性比较小，最好使用内存数据库，比如 Redis

Relational Database vs Datastore

Entities can be hierarchical

查询和索引

查询

retrieve entity from datastore
query methods
- programmatic
- web console
- GQL (Google Query Language)

索引

queries get results from indexes
index type
- Built-in: Allows single property queries
- Composite: use index.yaml

注意事项：避免过度使用index

solutions:
- 使用 index.yaml 缩小 index 范围
- 不需要索引时，不使用 index properties

数据一致性

Performance vs Accuracy

Strongly Consistent
- Parallel processes with orders guaranteed
- Use case: financial transaction
Eventually Consistent
- Parallel processes not with orders guaranteed
- 用例：人口普查（顺序不重要）

以下是 Entity 详情示例：

可以看到程序返回的JSON结构是：

结果如下:

Bigtable

NoSQL. 读写都支持高吞吐性. 低延迟。Google Analytics, Gmail 等主要产品都使用了Bigtable。

Bigtable的层次结构，涉及实例，集群和节点，而每个实例的数据模型涉及表，行，列族和列限定符。

表的设计如图所示：

Row key is only indexed item.

It offers similar API as HBase，我们都知道 HBase 是在 Google Bigtable 2006年发表的论文里的设计后开源出来的

区别：

Bigtable can scale and manage fast and easily （Bigtable 能够更轻松地扩展到更大数量的节点，从而可以处理给定实例的更多整体吞吐量。HBase 的设计需要一个主节点来处理故障转移和其他管理操作，这意味着随着您添加越来越多的节点（成千上万个）来处理越来越多的请求，主节点将成为性能瓶颈）
Bigtable encrypts data in-flight and at rest
Bigtable can be controlled access with IAM

Bigtable infrastructure

Front-end server pool serves requests to nodes
Compute and Storage are separate, No data is stored on the node except for metadata to direct requests to the correct tablet
Tables are shards into tablets. They are stored on Colossus, google’s filesystem. as storage is separate from compute node,
replication and recovery of node data is very fast, as only metadata/pointers need to be updated
Tablets are a way of referencing chunks of data that live on a particular node. The cool thing about tablets is that they can be split, combined, and moved around to other nodes to keep access to data spread evenly across the available capacity.

首次开始写入数据时，Bigtable集群可能会将大多数数据放在单个节点上。

启动时，Bigtable可能会将数据放在单个节点上。

随着更多 Tablet 在单个节点上积累，集群可能会将其中一些 Tablet 重新放置到另一个节点上，以更平衡的方式重新分配数据：

随着时间的推移写入的数据越来越多，某些 Tablet 的访问频率可能会比其他平板电脑更高。如下图所示，三个 Tablet 负责整个系统中所有读取查询的35％。

在这样的场景中，几个 hot Tablet 位于一个节点上，Bigtable 通过将一些访问频率较低的 Tablet 转移到其他容量更大的节点来重新平衡集群，以确保三个节点中的每个节点都能看到三分之一的总流量：

它也可能是一个单一的 Tablet 变得 too hot（它被写入或过于频繁地读取）。将 Tablet 原样移动到另一个节点并不能解决问题。相反，Bigtable的可 split 分裂这个 Tablet ，然后重新平衡：

最重要的事情是谨慎选择行键 rowkey，这样它们就不会将流量集中在一个地方。

上手练习：

界面操作：

Cloud Console 控制台左侧导航栏导航到Bigtable，创建实例

填写 Instance ID 等相关信息后：

以 Node.js 方式时，在编写一些代码以与Cloud Bigtable进行交互之前，您需要通过运行 npm install @google-cloud/bigtable@0.9.1 来安装客户端。

客户端安装后，您可以通过列出实例和集群来对其进行测试，如下所示：


const bigtable = require('@google-cloud/bigtable')({
  projectId: 'your-project-id'
});
 
const instance = bigtable.instance('test-instance');       
 
instance.createTable('todo', {                             
  families: ['completed']                                  
}).then((data) => 
  const table = data[0];
  console.log('Created table', table.id);
});

命令行操作：

install cbt in Google Cloud SDK


gcloud components update
gcloud components install cbt

set env variable

`1`	`echo -e "project=[PROJECT_ID]\ninstance=[INSTANCE_ID]">~/.cbtrc`

create table

`1`	`cbt createtable my-table`

list table

cbt ls

add column family

`1`	`cbt createfamily my-table cf1`

list column family

`1`	`cbt ls my-table`

add value to row1, column family cf1, column qualifier c1

`1`	`cbt set my-table r1 cf1:c1=testvalue`

read table

`1`	`cbt read my-table`

delete table

`1`	`cbt deletetable my-table`

BigQuery

数据仓库，接近实时的 PB 级数据库的分析

How BigQuery works

列式存储
不更新现有记录
无事务性

Structure

Dataset: contains tables/views
Table: collections of columns
Job: long running action/query

IAM

can control by project, dataset, view
cannot control at table level

命令行模式:

BigQuery案例

Find correlation between rain and bicycle rentals

How about joining the bicycle rentals data against weather data to learn whether there are fewer bicycle rentals on rainy days?

采用GCP提供的数据集:

数据导入成功后，在SQL输入框中写以下SQL:

WITH bicycle_rentals AS (

SELECT

COUNT(starttime) as num_trips,

EXTRACT(DATE from starttime) as trip_date

FROM `bigquery-public-data.new_york_citibike.citibike_trips`

GROUP BY trip_date

rainy_days AS

(

SELECT

date,

(MAX(prcp) > 5) AS rainy

FROM (

SELECT

wx.date AS date,

IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp

FROM

`bigquery-public-data.ghcn_d.ghcnd_2015` AS wx

WHERE

wx.id = 'USW00094728'

)

GROUP BY

date

)

SELECT

ROUND(AVG(bk.num_trips)) AS num_trips,

wx.rainy

FROM bicycle_rentals AS bk

JOIN rainy_days AS wx

ON wx.date = bk.trip_date

GROUP BY wx.rainy

执行结果是：

Dataproc

Fully managed: managed way to run Hadoop, Spark/Hive/Pig on GCP
Fast and Scalable: Quickly scale clusters up and down even when jobs are running (90 seconds or less on average)
Open source ecosystem: Easily migrate on-premises Hadoop/Spark jobs to the cloud (it's possible to move existing projects or ETL pipelines without redeveloping any code)
Cost effective: Cloud Dataproc is priced at $0.01 per virtual CPU per cluster per hour on top of any other GCP resources you use. And save money with preemptible instances (short-lived if you don't need them)
Versioning: image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop and other tools.
Integrated: It's integrated, it has built-in integration with Cloud Storage, BigQuery and Cloud Big Table to ensure data will never be lost.
This together with StackDriver Logging and StackDriver Monitoring provides a complete data platform,

Cloud Dataproc has two ways to customize clusters, optional components and initialization actions. Pre-configured optional components can be selected when deployed via the console or the command line and include Anaconda Jupyter notebook, Zeppelin notebook, Presto and Zookeeper.

Setup(Create a cluster):

console
gcloud command/YAML file
Deployment Manager template
Cloud SDK REST API

Configure:

For configuration the cluster can be set up as a single VM, which is usually to keep costs down for development and experimentation. Standard is with a single master node and high availability has three master nodes. You can choose between a region and a zone or select the global region and allow the service to choose the zone for you. The cluster defaults to a global endpoint but defining a regional endpoint may offer increased isolation and in certain cases lower latency. The master node is where the HDFS name node runs as well as the yarn node and job drivers. HDFS replication defaults to to in Cloud Dataproc. Optional components from the Hadoop ecosystem include Anaconda, which is your Python distribution in package manager, Web H CAD, Jupyter Notebook and Zeppelin Notebook as well. Cluster properties are runtime values that can be used by configuration files for more dynamic startup options. And user labels can be used to tag your cluster for your own solutions or your reporting purposes. The master node, worker nodes and preemptible worker nodes if enabled have separate VM options such as vCPU, memory and storage. Preemptible nodes include yarn node manager, but they don't run HDFS. There are a minimum number of worker nodes. The default is two, the maximum number of worker knows is determined by a quota and the number of SS Divs attached to each worker. You can also specify initialization actions such as an initialization script that we saw earlier. It can further customize your worker nodes on startup. And metadata can be defined, so the VM share state information between each other. This may be the first time you saw a preemptible nodes as an option for your cluster.

Optimize:

the main reason to use preemptible VMs or PVMs is to lower costs for fault-tolerant workloads. PVMs can be pulled from service at any time within 24 hours. But if your workload in your cluster architecture is a healthy mix of VMs and PVMs, you may be able to withstand the interruption and get a great discount in the cost of running your job. Custom machine types allow you to specify the balance of memory and CPU to tune the VM to the load, so you're not wasting resources. A custom image can be used to pre-install software. So it takes less time for the customized node become operational, then if you install the software boot time using an initialization script. You can get a persistent SSD boot disk for faster cluster startup.

Dataproc performance optimization

Keep your data close to your cluster
- Place Dataproc cluster in same region as storage bucket
Larger persistent disk = better performance
- Using SSD over HDD
Allocate more VMs
- Use preemptible VM to save on costs

Utilize: (how do you submit a job to Cloud Dataproc for processing? )

console
gcloud command
Orchestration services: Cloud Dataproc Workflow Templates; Cloud Composer
REST API

Monitoring:

Using StackDriver. Or you can also build a custom dashboard with graphs and set up monitoring of alert policies to send emails for example, where you can notify if incidents happen.

Any details from HDFS, YARN, metrics about a particular job or overall metrics for the cluster like CPU utilization, disk and network usage can all be monitored and alerted on with StackDriver.

Cloud Dataproc Initialization Actions

可参照：https://github.com/GoogleCloudDataproc/initialization-actions

There are a lot of pre-built startup scripts that you can leverage for common Hadoop cluster set of tasks like Flink, Jupyter and more.

use initializeion actions to add other software to cluster at startup

gcloud dataproc clusters create <CLUSTER_NAME> --initialication-actions gs://$MY_BUCKET/hbase/hbase.sh --num-masters 3 --num-workers 2

It's pretty easy to adapt existing Hadoop code to use GCS instead of HDFS. It's just a matter of changing the prefix for this storage from hdfs// to gs//.

Converting from HDFS to Google Cloud Storage

Copy data to GCS
- Install connector or copy manually
Update file prefix in scripts
- From hdfs:// to gs://
Use Dataproc and run against/output to GCS

创建Dataproc集群:

Cluster Name输入sparktodp，选择Image Type and Version，勾上Enable Gateway，Optional Components勾上Jupyter Notebook：

点击Notebook：

点击 "OPEN JUPYTERLAB" 打开Jupyter，运行01_spark.ipynb（Run All，或者一步步一个个Cell来），先把数据读到HDFS里，可以看到：

读数据：

Spark 分析：

一种就是调用DataFrame：

另一种就是使用Spark SQL：

执行结果：

最后可以通过matplotlib画图，把上面的attack_stats结果展示出来：

Replace HDFS by Google Cloud Storage

Load csv to BigQuery

bq mk sparktobq


BUCKET='cloud-training-demos-ml'  # CHANGE
bq --location=US load --autodetect --source_format=CSV sparktobq.kdd_cup_raw gs://$BUCKET/kddcup.data_10_percent.gz

Using Cloud Functions, launch analysis every time there is a new file in the bucket. (serverless)


%%bash
wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz
gunzip kddcup.data_10_percent.gz
BUCKET='cloud-training-demos-ml'  # CHANGE
gsutil cp kdd* gs://$BUCKET/
bq mk sparktobq


%%writefile main.py
 
from google.cloud import bigquery
import google.cloud.storage as gcs
import tempfile
import os
 
def create_report(BUCKET, gcsfilename, tmpdir):
    """
    Creates report in gs://BUCKET/ based on contents in gcsfilename (gs://bucket/some/dir/filename)
    """
    # connect to BigQuery
    client = bigquery.Client()
    destination_table = 'sparktobq.kdd_cup'
    
    # Specify table schema. Autodetect is not a good idea for production code
    job_config = bigquery.LoadJobConfig()
    schema = [
        bigquery.SchemaField("duration", "INT64"),
    ]
    for name in ['protocol_type', 'service', 'flag']:
        schema.append(bigquery.SchemaField(name, "STRING"))
    for name in 'src_bytes,dst_bytes,wrong_fragment,urgent,hot,num_failed_logins'.split(','):
        schema.append(bigquery.SchemaField(name, "INT64"))
    schema.append(bigquery.SchemaField("unused_10", "STRING"))
    schema.append(bigquery.SchemaField("num_compromised", "INT64"))
    schema.append(bigquery.SchemaField("unused_12", "STRING"))
    for name in 'su_attempted,num_root,num_file_creations'.split(','):
        schema.append(bigquery.SchemaField(name, "INT64")) 
    for fieldno in range(16, 41):
        schema.append(bigquery.SchemaField("unused_{}".format(fieldno), "STRING"))
    schema.append(bigquery.SchemaField("label", "STRING"))
    job_config.schema = schema
 
    # Load CSV data into BigQuery, replacing any rows that were there before
    job_config.create_disposition = bigquery.CreateDisposition.CREATE_IF_NEEDED
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    job_config.skip_leading_rows = 0
    job_config.source_format = bigquery.SourceFormat.CSV
    load_job = client.load_table_from_uri(gcsfilename, destination_table, job_config=job_config)
    print("Starting LOAD job {} for {}".format(load_job.job_id, gcsfilename))
    load_job.result()  # Waits for table load to complete.
    print("Finished LOAD job {}".format(load_job.job_id))
    
    # connections by protocol
    sql = """
        SELECT COUNT(*) AS count
        FROM sparktobq.kdd_cup
        GROUP BY protocol_type
        ORDER by count ASC    
    """
    connections_by_protocol = client.query(sql).to_dataframe()
    connections_by_protocol.to_csv(os.path.join(tmpdir,"connections_by_protocol.csv"))
    print("Finished analyzing connections")
    
    # attacks plot
    sql = """
                            SELECT 
                             protocol_type, 
                             CASE label
                               WHEN 'normal.' THEN 'no attack'
                               ELSE 'attack'
                             END AS state,
                             COUNT(*) as total_freq,
                             ROUND(AVG(src_bytes), 2) as mean_src_bytes,
                             ROUND(AVG(dst_bytes), 2) as mean_dst_bytes,
                             ROUND(AVG(duration), 2) as mean_duration,
                             SUM(num_failed_logins) as total_failed_logins,
                             SUM(num_compromised) as total_compromised,
                             SUM(num_file_creations) as total_file_creations,
                             SUM(su_attempted) as total_root_attempts,
                             SUM(num_root) as total_root_acceses
                           FROM sparktobq.kdd_cup
                           GROUP BY protocol_type, state
                           ORDER BY 3 DESC
    """
    attack_stats = client.query(sql).to_dataframe()
    ax = attack_stats.plot.bar(x='protocol_type', subplots=True, figsize=(10,25))
    ax[0].get_figure().savefig(os.path.join(tmpdir,'report.png'));
    print("Finished analyzing attacks")
    
    bucket = gcs.Client().get_bucket(BUCKET)
    for blob in bucket.list_blobs(prefix='sparktobq/'):
        blob.delete()
    for fname in ['report.png', 'connections_by_protocol.csv']:
        bucket.blob('sparktobq/{}'.format(fname)).upload_from_filename(os.path.join(tmpdir,fname))
    print("Uploaded report based on {} to {}".format(gcsfilename, BUCKET))
 
 
def bigquery_analysis_cf(data, context):
    # check that trigger is for a file of interest
    bucket = data['bucket']
    name = data['name']
    if ('kddcup' in name) and not ('gz' in name):
        filename = 'gs://{}/{}'.format(bucket, data['name'])
        print(bucket, filename)
        with tempfile.TemporaryDirectory() as tmpdir:
            create_report(bucket, filename, tmpdir)


# test that the function works
import main as bq
 
BUCKET='cloud-training-demos-ml' # CHANGE
try:
    bq.create_report(BUCKET, 'gs://{}/kddcup.data_10_percent'.format(BUCKET), "/tmp")
except Exception as e:
    print(e.errors)

gcloud functions deploy bigquery_analysis_cf --runtime python37 --trigger-resource $BUCKET --trigger-event google.storage.object.finalize

Verify that the Cloud Function is being run. You can do this from the Cloud Functions part of the GCP Console.

Once the function is complete (in about 30 seconds), see if the output folder contains the report:

gsutil ls gs://$BUCKET/sparktobq

Dataflow

is managed data pipelines

Processes data using Compute Engine
- Clusters are sized for you
- Automated scaling
Write code for batch and streaming
Auto scaling, No-Ops, Stream and Batch Processing
Built on Apache Beam
Pipelines are regional-based

Why use Cloud Dataflow?

ETL
Data analytics: batch or streaming
Orchestration: create pipelines that coordinate services, including external services
Integrates with GCP services

Data Processing

Solution:
Apache Beam + Cloud Dataflow

Data Transformation

Cloud Dataproc vs Cloud Dataflow

Key Terms

Element : single entry of data (eg. table row)
PCollection: Distributed data set, input and output
Transform: Data processing in pipeline
ParDo: Type of Transform

Cloud Pub/Sub

is scalable, reliable messaging

Supports many-to-many asynchronous messaging
Push/pull to topics
Support for offline consumers
At least once delivery policy

Global scale messaging buffer/coupler
No-ops
Decouples senders and receivers
Equivalent to Kafka
At-least-once delivery

Pub/Sub overview

Topic: publisher sends messages to topic
Messages are stored in message store until they are delivered and acknowledged by subscribers
Pub/Sub forwards messages from a topic to subscribers. messages can be pushed by Pub/Sub to subscriber or pulled by subscribers from Pub/Sub
Subscriber receives pending messages from subscription and acknowledge to Pub/Sub
After message is acknowledged by the subscriber, it is removed from the subscription’s queue of messages.

Push and Pull

Push = lower latency, more real-time
Push subscribers must be Webhook endpoints that accept POST over HTTPS
Pull ideal for large volume of messages - batch delivery

Demo: how to publish and receive messages in PubSub with Java

create topic
1

gcloud pubsub topics create my-topic

create subscription to this topic

`1`	`gcloud pubsub subscriptions create my-sub --topic my-topic`

git clone project into cloud shell

`1`	`git clone https://github.com/googleapis/java-pubsub.git`

go into the sample
1

cd samples/snippets/
modify PublisherExample.java and SubscribeAsyncExample.java to put the right project id, topic id and subscription id
compile project
1

mvn clean install -DskipTests

run subscriber

`1`	`mvn exec:java -Dexec.mainClass="pubsub.SubscribeAsyncExample"`

run publisher in another screen and observe subscriber

`1`	`mvn exec:java -Dexec.mainClass="pubsub.PublisherExample"`

Datalab

interactive data exploration (Notebook)

Built on Jupyter (formerly IPython)

Easily deploy models to BigQuery. You can visualize data with Google Charts or map plot line

Comparing

Relational database: “Consistency and Reliability over Performance”

Non-Relational Database: “Performance over Consistency”

How to choose the right storage

Cloud Composer & Apache Airflow

Orchestrating work between GCP services with Cloud Composer

使用谷歌云上的Cloud Composer，就可以不用自己装Airflow，只需要关注workflow。

Cloud Composer用GCS(Google Cloud Storage)存储Apache Airflow DAGs，可以在我们的环境里新增，更新，删除DAGs。

The DAGs folder is simply a GCS bucket where you will load your pipeline code. a GCS bucket that is automatically created for when you launch your Cloud Composer Instance.

通过Cloud Functions去event trigger，或者通过schedule去周期性执行

Monitoring and Logging等都可以点击对应的Job详情查看Job的运行情况和细节。

Airflow官网：https://airflow.incubator.apache.org/

Airflow是开源的：https://github.com/apache/airflow

Airflow官方文档：https://airflow.incubator.apache.org/docs/apache-airflow/stable/index.html

Cloud Composer 是基于 Apache Airflow 构建的全代管式工作流编排服务。

端到端地集成多种 Google Cloud 产品，包括 BigQuery、Dataflow、Dataproc、Datastore、Cloud Storage、Pub/Sub 和 AI Platform，让用户可以灵活自由地全方位编排流水线（data pipeline），编写、安排（schedule）和监控（monitor）工作流（workflow）。

What is a Workflow?

a sequence of tasks
started on a schedule or triggered by an event
frequently used to handle big data processing pipelines

安装及使用 Airflow：

pip3 install apache-airflow

airflow db init

airflow webserver -p 8080

airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin

访问 http://localhost:8080/，输入username和password均为admin即可登录成功：

Graph View:

example_bash_operator:


#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
 
"""Example DAG demonstrating the usage of the BashOperator."""
 
from datetime import timedelta
 
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago
 
args = {
    'owner': 'airflow',
}
 
dag = DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    start_date=days_ago(2),
    dagrun_timeout=timedelta(minutes=60),
    tags=['example', 'example2'],
    params={"example_key": "example_value"},
)
 
run_this_last = DummyOperator(
    task_id='run_this_last',
    dag=dag,
)
 
# [START howto_operator_bash]
run_this = BashOperator(
    task_id='run_after_loop',
    bash_command='echo 1',
    dag=dag,
)
# [END howto_operator_bash]
 
run_this >> run_this_last
 
for i in range(3):
    task = BashOperator(
        task_id='runme_' + str(i),
        bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
        dag=dag,
    )
    task >> run_this
 
# [START howto_operator_bash_template]
also_run_this = BashOperator(
    task_id='also_run_this',
    bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
    dag=dag,
)
# [END howto_operator_bash_template]
also_run_this >> run_this_last
 
if __name__ == "__main__":
    dag.cli()

Trigger DAG 后可以 View Logs。

或者通过 docker 装 airflow：

docker-compose.yml


version: '3'
services:
  postgres:
    image: postgres:9.6
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_PASSWORD=airflow
      - POSTGRES_DB=airflow
    ports:
      - "5432:5432"
 
  webserver:
    image: puckel/docker-airflow:1.10.1
    build:
      context: https://github.com/puckel/docker-airflow.git#1.10.1
      dockerfile: Dockerfile
      args:
        AIRFLOW_DEPS: gcp_api,s3
        PYTHON_DEPS: sqlalchemy==1.2.0
    restart: always
    depends_on:
      - postgres
    environment:
      - LOAD_EX=n
      - EXECUTOR=Local
      - FERNET_KEY=jsDPRErfv8Z_eVTnGfF8ywd19j4pyqE3NpdUBA_oRTo=
    volumes:
      - ./examples/intro-example/dags:/usr/local/airflow/dags
      # Uncomment to include custom plugins
      # - ./plugins:/usr/local/airflow/plugins
    ports:
      - "8080:8080"
    command: webserver
    healthcheck:
      test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
      interval: 30s
      timeout: 30s
      retries: 3

docker-compose up

即可在http://localhost:8080/看到 airflow web ui

docker-compose logs

docker-compose down

或者通过下面这种Dockerfile:


# Base Image
FROM python:3.7-slim-buster
 
# Arguments that can be set with docker build
ARG AIRFLOW_VERSION=1.10.1
ARG AIRFLOW_HOME=/usr/local/airflow
 
# Export the environment variable AIRFLOW_HOME where airflow will be installed
ENV AIRFLOW_HOME=${AIRFLOW_HOME}
 
ENV AIRFLOW_GPL_UNIDECODE=1
 
# Install dependencies and tools
RUN apt-get update -yqq && \
    apt-get upgrade -yqq && \
    apt-get install -yqq --no-install-recommends \ 
    wget \
    libczmq-dev \
    curl \
    libssl-dev \
    git \
    inetutils-telnet \
    bind9utils freetds-dev \
    libkrb5-dev \
    libsasl2-dev \
    libffi-dev libpq-dev \
    freetds-bin build-essential \
    default-libmysqlclient-dev \
    apt-utils \
    rsync \
    zip \
    unzip \
    gcc \
    locales \
    procps \
    && apt-get clean
 
# Load custom configuration
COPY ./airflow.cfg ${AIRFLOW_HOME}/airflow.cfg
 
# Upgrade pip
# Create airflow user 
# Install apache airflow with subpackages
RUN pip install --upgrade pip && \
    useradd -ms /bin/bash -d ${AIRFLOW_HOME} airflow && \
    pip install apache-airflow==${AIRFLOW_VERSION} --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.1/constraints-3.7.txt"
 
# Copy the entrypoint.sh from host to container (at path AIRFLOW_HOME)
COPY ./entrypoint.sh /entrypoint.sh
 
# Set the entrypoint.sh file to be executable
RUN chmod +x ./entrypoint.sh
 
# Set the owner of the files in AIRFLOW_HOME to the user airflow
RUN chown -R airflow: ${AIRFLOW_HOME}
 
# Set the username to use
USER airflow
 
# Set workdir (it's like a cd inside the container)
WORKDIR ${AIRFLOW_HOME}
# Create the dags folder which will contain the DAGs
RUN mkdir dags
# Expose the webserver port
EXPOSE 8080
# Execute the entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]

entrypoint.sh:


#!/usr/bin/env bash
 
# Initiliaze the metadata database
airflow initdb
 
# Run the scheduler in background
airflow scheduler &> /dev/null &
 
# Run the web server in foreground (for docker logs)
exec airflow webserver

然后 Build the Airflow image

docker build --tag airflow .

Run the Airflow container

docker run --name my_airflow -it -d -p 8080:8080 airflow

Verify that your Airflow container is running and healthy:

docker ps

Check out the logs:

docker logs my_airflow

mount /xxx目录下python文件写成的DAG到AIRFLOW_HOME目录下的dags目录：

docker run --name my_airflow -it -d -p 8080:8080 --mount type=bind,source=/xxx/my_dag.py,target=/usr/local/airflow/dags/my_dag.py airflow

进入验证my_dag在dags目录下：

docker exec -it my_airflow ls /usr/local/airflow/dags

exec into the container to access the shell.

docker exec -it my_airflow bash

Next, make sure the DAG was parsed correctly:

python dags/my_dag.py

选择Airflow和Python版本，点击创建，即可成功创建env。

还可以安装Python依赖：

接下来，我们就可以参照上面的 example_bash_operator 写 DAG：

跟 BigQuery 集成可以用 bigquery_operator，并且在 Web UI 上设置 Connection，从而操作 BigQuery 里的 Dataset，在 task 里可以写 sql 或者指明 sql 文件。

Airflow 还有另一个比较常用的是 Variables，它就是 key-value 键值对。

推荐以下 Airflow 中文文档：

https://www.kancloud.cn/luponu/airflow-doc-zh/889656

及以下 Youtube视频：

Airflow tutorial 1: Introduction to Apache Airflow

Airflow tutorial 2: Set up airflow environment with docker

Airflow tutorial 3: Set up airflow environment using Google Cloud Composer

Airflow tutorial 4: Writing your first pipeline

Airflow tutorial 5: Airflow concept

Airflow tutorial 6: Build a data pipeline using Google Bigquery

Airflow tutorial 7: Airflow variables

Data Catalog

元数据管理

(1) System: BIGQUERY

Type: Dataset, Table

Resource URL: link to BigQuery URL

Google Data Studio

连接数据源，BI report可视化报表，可以share report，也可以查看shared with me/owned by me的report

Monitoring

Incident, Dashboards, Alerting等

Logging

Logs explorer, Logs Dashboard, Logs Storage retention period等

Machine Learning

TensorFlow
Cloud ML
Machine Learning APIs

Why use CLoud Machine Learning Platform?

For structured data
- Classification and regression
- Recommendation
- Anomaly detection
For unstructured data
- Image and video analytics
- Text analytics
Gain insight from images
Detect inappropriate content
Analyze sentiment
Extract text

Cloud Natural Language API

can return text in real time
Highly accurate, even in noisy environments
Access from any device

Cloud Translation API

Translate strings
Programmatically detect a document’s language
Support for dozen’s languages

Cloud Video Intelligence API

Annotate the contents of video
Detect scene changes
Flag inappropriate content
Support for a variety of video formats

Cloud Build

Run Infrastructure as a code. Let you orchestrate build steps that run as container images and automate Terraform workflow.

可参照https://github.com/agmsb/googlecloudbuild-terraform

Cloud Build 可以从各种代码库或云存储空间导入源代码，根据您的规范执行构建，并生成诸如 Docker 容器或 Java 归档的软件工件。

可以通过 Google Cloud Console、gcloud 命令行工具或 Cloud Build 的 REST API 使用 Cloud Build。

在 Cloud Console 中，您可以通过构建记录页面查看 Cloud Build 构建结果，并通过构建触发器进行自动构建。

您可以使用 gcloud 工具创建和管理构建，并可以运行命令来执行提交构建、列出构建和取消构建等任务。

您可以使用 Cloud Build REST API 请求构建。

与其他 Cloud Platform API 一样，您必须使用 OAuth2 授予访问权限。获得访问授权后，您可以使用 API 启动新构建、查看构建状态和详情、列出每个项目的构建并取消当前正在进行的构建。

构建配置和构建步骤

可以编写构建配置，向 Cloud Build 提供有关执行什么任务的说明。可以将构建配置为提取依赖项，运行单元测试、静态分析和集成测试，并使用 docker、gradle、maven、bazel 和 gulp 等构建工具创建软件工件。

Cloud Build 将构建作为一系列构建步骤执行，其中的每个构建步骤都在 Docker 容器中运行。执行构建步骤类似于在脚本中执行命令。

您可以使用 Cloud Build 和 Cloud Build 社区提供的构建步骤，也可以编写自己的自定义构建步骤：

Cloud Build 提供的构建步骤：Cloud Build 发布了一组适用于常用语言和任务的受支持开源构建步骤。
社区提供的构建步骤：Cloud Build 用户社区提供了开源构建步骤。
自定义构建步骤：您可以自行创建要在自己的构建中使用的构建步骤。

每个构建步骤都通过其连接到本地 Docker 网络（名为 cloudbuild）的容器运行。这使构建步骤可以相互通信并共享数据。

您可以在 Cloud Build 中使用标准 Docker Hub 映像，例如 Ubuntu 和 Gradle。

构建的工作原理

以下步骤描述了一般而言的 Cloud Build 构建生命周期：

准备应用代码及任何所需资源。
创建 YAML 或 JSON 格式的构建配置文件，其中包含 Cloud Build 的说明。
将构建提交到 Cloud Build。
Cloud Build 根据您提供的构建配置执行构建。
如果适用，构建的所有映像都将推送到 Container Registry。Container Registry 可在 Google Cloud 上提供安全、私密的 Docker 映像存储空间。

程序运行结果:

Cloud Run & Cloud Functions & App Engine

以下是Java code部署的场景例子，更多场景可以查看其他官方文档:

使用 App Engine：https://cloud.google.com/appengine/docs/flexible/java/quickstart

使用 Compute Engine：https://cloud.google.com/java/getting-started/getting-started-on-compute-engine

使用 Jib 构建 Java 容器：https://cloud.google.com/java/getting-started/jib

https://cloud.tencent.com/developer/news/612944

Management Tool

Deployment Manager

输入以下命令即可看到创建my-vm成功：

my-vm详情如下:

Pricing

Budget and Alerts

基于 GCP project上的 billing 账户，可以定义在 50%，90% 和 100% 时触发 alerts，可导出账单详情，在 Report 上可看出支出详情。Quotas 可用来预防过度消费资源，有速率分额限制和分配数量限制，比如 Kubernetes services 可设定分额为每 100 秒最多 1000 个调用，每个 project 最多 5 个 VPN。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/Guff_9hys/article/detail/1014947