当前位置:   article > 正文

时空数据库实践(含 纽约TAXI数据透视分析) - PostGIS + TimescaleDB => PostgreSQL

用pg存储时空类数据

标签

PostgreSQL , 时空数据库 , 时序数据库 , 空间数据库 , timescaleDB , PostGIS


背景

现实社会中,很多业务产生的数据具有时序数据属性(在时间维度上顺序写入,同时包括大量时间区间查询统计的需求)。

例如业务的FEED数据,物联网产生的时序数据(如气象传感器、车辆轨迹、等),金融行业的实时数据等等。

PostgreSQL的UDF和BRIN(块级索引)很适合时序数据的处理。具体有以下的两个例子

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》

pic

实际上PostgreSQL生态中,衍生了一个时序插件:timescaleDB。专门用于处理时序数据。(timescale的改进,包括SQL优化器的改进(支持merge append,时间片聚合非常高效),rotate接口,自动分片等)

同时timescaleDB也非常受投资者的关注,已获5000万美金的投资,也间接说明时序数据库在未来是非常受用户欢迎的。

timescaleDB的优势

首先,timescaleDB是自动切片的,对用户无感知,在数据量非常庞大的时候,写入性能不衰减。(主要指IOPS较低的磁盘,如果IOPS较好的磁盘PG在写入大量数据后性能也是OK的。)

pic

其次,timescale改进了SQL优化器,增加了merge append的执行节点,同时在对小时间片进行group by时,可以不用HASH或GROUP整个数据范围,而是分片计算,使得效率非常高。

最后,timescale增加了一些API,使得用户在时序数据的写入、维护、查询都非常的高效、同时易于维护。

API如下

http://docs.timescale.com/v0.8/api

部署timescaleDB

以CentOS 7.x x64为例。

1、首先要安装好PostgreSQL

参考 《PostgreSQL on Linux 最佳部署手册》

  1. export USE_NAMED_POSIX_SEMAPHORES=1
  2. LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10 --with-segsize=8 --with-wal-segsize=256
  3. LIBS=-lpthread CFLAGS="-O3" make world -j 64
  4. LIBS=-lpthread CFLAGS="-O3" make install-world

2、其次需要安装cmake3

  1. epel
  2. yum install -y cmake3
  3. ln -s /usr/bin/cmake3 /usr/bin/cmake

3、编译timescaleDB

  1. git clone https://github.com/timescale/timescaledb/
  2. cd timescaledb
  3. git checkout release-0.8.0
  4. wget https://github.com/timescale/timescaledb/archive/0.8.0.tar.gz
  5. export PATH=/home/digoal/pgsql10/bin:$PATH
  6. export LD_LIBRARY_PATH=/home/digoal/pgsql10/lib:$LD_LIBRARY_PATH
  7. # Bootstrap the build system
  8. ./bootstrap
  9. cd ./build && make
  10. make install
  11. [ 2%] Built target sqlupdatefile
  12. [ 4%] Built target sqlfile
  13. [100%] Built target timescaledb
  14. Install the project...
  15. -- Install configuration: "Release"
  16. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb.control
  17. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.8.0.sql
  18. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.1--0.8.0.sql
  19. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.1.0--0.2.0.sql
  20. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.2.0--0.3.0.sql
  21. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.3.0--0.4.0.sql
  22. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.0--0.4.1.sql
  23. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.1--0.4.2.sql
  24. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.2--0.5.0.sql
  25. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.5.0--0.6.0.sql
  26. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.0--0.6.1.sql
  27. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.0.sql
  28. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.1.sql
  29. -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.0--0.7.1.sql
  30. -- Installing: /home/dege.zzz/pgsql10/lib/timescaledb.so

4、配置postgresql.conf,在数据库启动时自动加载timescale lib库。

  1. vi $PGDATA/postgresql.conf
  2. shared_preload_libraries = 'timescaledb'
  3. pg_ctl restart -m fast

5、对需要使用timescaledb的数据库,创建插件.

  1. psql
  2. psql (10.1)
  3. Type "help" for help.
  4. postgres=# create extension timescaledb ;

6、timescaledb的相关参数

  1. timescaledb.constraint_aware_append
  2. timescaledb.disable_optimizations
  3. timescaledb.optimize_non_hypertables
  4. timescaledb.restoring
  5. postgres=# show timescaledb.constraint_aware_append ;
  6. timescaledb.constraint_aware_append
  7. -------------------------------------
  8. on
  9. (1 row)
  10. postgres=# show timescaledb.disable_optimizations ;
  11. timescaledb.disable_optimizations
  12. -----------------------------------
  13. off
  14. (1 row)
  15. postgres=# show timescaledb.optimize_non_hypertables ;
  16. timescaledb.optimize_non_hypertables
  17. --------------------------------------
  18. off
  19. (1 row)
  20. postgres=# show timescaledb.restoring ;
  21. timescaledb.restoring
  22. -----------------------
  23. off
  24. (1 row)

timescaleDB使用例子1 - 纽约TAXI数据透视分析

第一个例子是real-life New York City taxicab data ,

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

数据为真实的数据,来自

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

1、下载样本数据

wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz  

2、解压

tar -zxvf nyc_data.tar.gz   

3、建表,其中包括将普通表转换为时序存储表的API create_hypertable 的使用。

psql -f nyc_data.sql  

截取一些nyc_data.sql的内容如下:

  1. cat nyc_data.sql
  2. -- 打车数据: 包括时长、计费、路程、上车、下车经纬度、时间、人数等等。
  3. CREATE TABLE "rides"(
  4. vendor_id TEXT,
  5. pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  6. dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  7. passenger_count NUMERIC,
  8. trip_distance NUMERIC,
  9. pickup_longitude NUMERIC,
  10. pickup_latitude NUMERIC,
  11. rate_code INTEGER,
  12. dropoff_longitude NUMERIC,
  13. dropoff_latitude NUMERIC,
  14. payment_type INTEGER,
  15. fare_amount NUMERIC,
  16. extra NUMERIC,
  17. mta_tax NUMERIC,
  18. tip_amount NUMERIC,
  19. tolls_amount NUMERIC,
  20. improvement_surcharge NUMERIC,
  21. total_amount NUMERIC
  22. );

这句话,将rides转换为时序表存储

SELECT create_hypertable('rides', 'pickup_datetime', 'payment_type', 2, create_default_indexes=>FALSE);  

创建索引

  1. CREATE INDEX ON rides (vendor_id, pickup_datetime desc);
  2. CREATE INDEX ON rides (pickup_datetime desc, vendor_id);
  3. CREATE INDEX ON rides (rate_code, pickup_datetime DESC);
  4. CREATE INDEX ON rides (passenger_count, pickup_datetime desc);

4、导入测试数据

  1. psql -c "\COPY rides FROM nyc_data_rides.csv CSV"
  2. COPY 10906858

5、对已转换为时序存储表的rides执行一些测试SQL,性能比PostgreSQL普通表要好。

每天同车超过2人的交易,平均计费多少?

  1. -- Average fare amount of rides with 2+ passengers by day
  2. SELECT date_trunc('day', pickup_datetime) as day, avg(fare_amount)
  3. FROM rides
  4. WHERE passenger_count > 1 AND pickup_datetime < '2016-01-08'
  5. GROUP BY day ORDER BY day;
  6. day | avg
  7. --------------------+---------------------
  8. 2016-01-01 00:00:00 | 13.3990821679715529
  9. 2016-01-02 00:00:00 | 13.0224687415181399
  10. 2016-01-03 00:00:00 | 13.5382068607068607
  11. 2016-01-04 00:00:00 | 12.9618895561740149
  12. 2016-01-05 00:00:00 | 12.6614611935518309
  13. 2016-01-06 00:00:00 | 12.5775245695086098
  14. 2016-01-07 00:00:00 | 12.5868802584437019
  15. (7 rows)

6、某些查询的性能甚至超过20倍

每天有多少笔交易。

  1. -- Total number of rides by day for first 5 days
  2. SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides
  3. GROUP BY day ORDER BY day
  4. LIMIT 5;
  5. day | count
  6. --------------------+--------
  7. 2016-01-01 00:00:00 | 345037
  8. 2016-01-02 00:00:00 | 312831
  9. 2016-01-03 00:00:00 | 302878
  10. 2016-01-04 00:00:00 | 316171
  11. 2016-01-05 00:00:00 | 343251
  12. (5 rows)

timescale增加了merge append的执行优化,因此在时间片上按小粒度聚合,效率非常高,数据量越大,性能提升的效果越明显。

For example, TimescaleDB introduces a time-based "merge append" optimization to minimize the number of
groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

  1. SELECT date_trunc('minute', time) AS minute, max(usage_user)
  2. FROM cpu
  3. WHERE time < '2017-01-01'
  4. GROUP BY minute
  5. ORDER BY minute DESC
  6. LIMIT 5;

7、执行一些timescaleDB特有的功能函数,例如time_bucket,这里同样会用到timescaleDB内置的一些加速算法。

每5分钟间隔为一个BUCKET,输出每个间隔产生了多少笔订单。

  1. -- Number of rides by 5 minute intervals
  2. -- (using the TimescaleDB "time_bucket" function)
  3. SELECT time_bucket('5 minute', pickup_datetime) as five_min, count(*)
  4. FROM rides
  5. WHERE pickup_datetime < '2016-01-01 02:00'
  6. GROUP BY five_min ORDER BY five_min;
  7. five_min | count
  8. ---------------------+-------
  9. 2016-01-01 00:00:00 | 703
  10. 2016-01-01 00:05:00 | 1482
  11. 2016-01-01 00:10:00 | 1959
  12. 2016-01-01 00:15:00 | 2200
  13. 2016-01-01 00:20:00 | 2285
  14. 2016-01-01 00:25:00 | 2291
  15. 2016-01-01 00:30:00 | 2349
  16. 2016-01-01 00:35:00 | 2328
  17. 2016-01-01 00:40:00 | 2440
  18. 2016-01-01 00:45:00 | 2372
  19. 2016-01-01 00:50:00 | 2388
  20. 2016-01-01 00:55:00 | 2473
  21. 2016-01-01 01:00:00 | 2395
  22. 2016-01-01 01:05:00 | 2510
  23. 2016-01-01 01:10:00 | 2412
  24. 2016-01-01 01:15:00 | 2482
  25. 2016-01-01 01:20:00 | 2428
  26. 2016-01-01 01:25:00 | 2433
  27. 2016-01-01 01:30:00 | 2337
  28. 2016-01-01 01:35:00 | 2366
  29. 2016-01-01 01:40:00 | 2325
  30. 2016-01-01 01:45:00 | 2257
  31. 2016-01-01 01:50:00 | 2316
  32. 2016-01-01 01:55:00 | 2250
  33. (24 rows)

8、执行一些统计分析SQL

每个城市的打车交易量。

  1. -- Join rides with rates to get more information on rate_code
  2. SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides
  3. JOIN rates on rides.rate_code = rates.rate_code
  4. WHERE pickup_datetime < '2016-01-08'
  5. GROUP BY rates.description ORDER BY rates.description;
  6. description | num_trips
  7. -----------------------+-----------
  8. JFK | 54832
  9. Nassau or Westchester | 967
  10. Newark | 4126
  11. group ride | 17
  12. negotiated fare | 7193
  13. standard rate | 2266401
  14. (6 rows)

某些城市2016年1月的打车统计(最长、短距离、平均人数、时长等)

  1. -- Analysis of all JFK and EWR rides in Jan 2016
  2. SELECT rates.description, COUNT(vendor_id) as num_trips,
  3. AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,
  4. AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,
  5. AVG(passenger_count) as avg_passengers
  6. FROM rides
  7. JOIN rates on rides.rate_code = rates.rate_code
  8. WHERE rides.rate_code in (2,3) AND pickup_datetime < '2016-02-01'
  9. GROUP BY rates.description ORDER BY rates.description;
  10. description | num_trips | avg_trip_duration | avg_total | avg_tip | min_distance | avg_distance | max_distance | avg_passengers
  11. -------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------
  12. JFK | 225019 | 00:45:46.822517 | 64.3278115181384683 | 7.3334228220728027 | 0.00 | 17.2602816651038357 | 221.00 | 1.7333869584346211
  13. Newark | 16822 | 00:35:16.157472 | 86.4633688027582927 | 9.5461657353465700 | 0.00 | 16.2706122934252764 | 177.23 | 1.7435501129473309
  14. (2 rows)

9、数据自动分片与执行计划

  1. postgres=# \d+ rides
  2. Table "public.rides"
  3. Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
  4. -----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
  5. vendor_id | text | | | | extended | |
  6. pickup_datetime | timestamp without time zone | | not null | | plain | |
  7. dropoff_datetime | timestamp without time zone | | not null | | plain | |
  8. passenger_count | numeric | | | | main | |
  9. trip_distance | numeric | | | | main | |
  10. pickup_longitude | numeric | | | | main | |
  11. pickup_latitude | numeric | | | | main | |
  12. rate_code | integer | | | | plain | |
  13. dropoff_longitude | numeric | | | | main | |
  14. dropoff_latitude | numeric | | | | main | |
  15. payment_type | integer | | | | plain | |
  16. fare_amount | numeric | | | | main | |
  17. extra | numeric | | | | main | |
  18. mta_tax | numeric | | | | main | |
  19. tip_amount | numeric | | | | main | |
  20. tolls_amount | numeric | | | | main | |
  21. improvement_surcharge | numeric | | | | main | |
  22. total_amount | numeric | | | | main | |
  23. Indexes:
  24. "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)
  25. "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)
  26. "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)
  27. "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)
  28. Child tables: _timescaledb_internal._hyper_1_1_chunk,
  29. _timescaledb_internal._hyper_1_2_chunk,
  30. _timescaledb_internal._hyper_1_3_chunk,
  31. _timescaledb_internal._hyper_1_4_chunk
  32. 其中一个分片的约束如下
  33. Check constraints:
  34. "constraint_1" CHECK (pickup_datetime >= '2015-12-31 00:00:00'::timestamp without time zone AND pickup_datetime < '2016-01-30 00:00:00'::timestamp without time zone)
  35. "constraint_2" CHECK (_timescaledb_internal.get_partition_hash(payment_type) >= 1073741823)
  36. Inherits: rides
  1. -- Peek behind the scenes
  2. postgres=# select count(*) from rides;
  3. count
  4. ----------
  5. 10906858
  6. (1 row)
  7. Time: 376.247 ms
  8. postgres=# explain select count(*) from rides;
  9. QUERY PLAN
  10. ------------------------------------------------------------------------------------------------------------
  11. Finalize Aggregate (cost=254662.23..254662.24 rows=1 width=8)
  12. -> Gather (cost=254661.71..254662.22 rows=5 width=8)
  13. Workers Planned: 5
  14. -> Partial Aggregate (cost=253661.71..253661.72 rows=1 width=8)
  15. -> Append (cost=0.00..247468.57 rows=2477258 width=0)
  16. -> Parallel Seq Scan on rides (cost=0.00..0.00 rows=1 width=0)
  17. -> Parallel Seq Scan on _hyper_1_1_chunk (cost=0.00..77989.57 rows=863657 width=0)
  18. -> Parallel Seq Scan on _hyper_1_2_chunk (cost=0.00..150399.01 rows=1331101 width=0)
  19. -> Parallel Seq Scan on _hyper_1_3_chunk (cost=0.00..6549.75 rows=112675 width=0)
  20. -> Parallel Seq Scan on _hyper_1_4_chunk (cost=0.00..12530.24 rows=169824 width=0)
  21. (10 rows)

10、也可以直接查分片

  1. postgres=# select count(*) from _timescaledb_internal._hyper_1_1_chunk;
  2. count
  3. ---------
  4. 3454961
  5. (1 row)

分片对用户完全透明

分片元数据:

  1. postgres=# \dn
  2. List of schemas
  3. Name | Owner
  4. -----------------------+----------
  5. _timescaledb_cache | postgres
  6. _timescaledb_catalog | postgres
  7. _timescaledb_internal | postgres
  8. public | postgres
  9. (4 rows)

timescaleDB + PostGIS 双剑合璧 - 时空数据库

结合时序数据库timescaleDB插件,空间数据库PostGIS插件。PostgreSQL可以很好的处理空间数据。

1、创建空间数据库PostGIS创建

create extension postgis;  

2、添加空间类型字段

http://postgis.net/docs/manual-2.4/AddGeometryColumn.html

  1. postgres=# SELECT AddGeometryColumn ('public','rides','pickup_geom',2163,'POINT',2);
  2. addgeometrycolumn
  3. --------------------------------------------------------
  4. public.rides.pickup_geom SRID:2163 TYPE:POINT DIMS:2
  5. (1 row)
  6. postgres=# SELECT AddGeometryColumn ('public','rides','dropoff_geom',2163,'POINT',2);
  7. addgeometrycolumn
  8. ---------------------------------------------------------
  9. public.rides.dropoff_geom SRID:2163 TYPE:POINT DIMS:2
  10. (1 row)
  11. postgres=#
  12. postgres=# \d+ rides
  13. Table "public.rides"
  14. Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
  15. -----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
  16. vendor_id | text | | | | extended | |
  17. pickup_datetime | timestamp without time zone | | not null | | plain | |
  18. dropoff_datetime | timestamp without time zone | | not null | | plain | |
  19. passenger_count | numeric | | | | main | |
  20. trip_distance | numeric | | | | main | |
  21. pickup_longitude | numeric | | | | main | |
  22. pickup_latitude | numeric | | | | main | |
  23. rate_code | integer | | | | plain | |
  24. dropoff_longitude | numeric | | | | main | |
  25. dropoff_latitude | numeric | | | | main | |
  26. payment_type | integer | | | | plain | |
  27. fare_amount | numeric | | | | main | |
  28. extra | numeric | | | | main | |
  29. mta_tax | numeric | | | | main | |
  30. tip_amount | numeric | | | | main | |
  31. tolls_amount | numeric | | | | main | |
  32. improvement_surcharge | numeric | | | | main | |
  33. total_amount | numeric | | | | main | |
  34. pickup_geom | geometry(Point,2163) | | | | main | |
  35. dropoff_geom | geometry(Point,2163) | | | | main | |
  36. Indexes:
  37. "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)
  38. "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)
  39. "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)
  40. "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)
  41. Child tables: _timescaledb_internal._hyper_1_1_chunk,
  42. _timescaledb_internal._hyper_1_2_chunk,
  43. _timescaledb_internal._hyper_1_3_chunk,
  44. _timescaledb_internal._hyper_1_4_chunk

3、将数据更新到geometry字段(实际存储为两个自动,分别表示经度和纬度。实际上不更新也没关系,因为PG支持表达式索引,完全可以使用这两个字段,创建表达式空间索引)。

  1. -- Generate the geometry points and write to table
  2. -- (Note: These calculations might take a few mins)
  3. UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);
  4. UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);
  5. vacuum full rides;

4、时空分析举例。

在(lat, long) (40.7589,-73.9851)附近400米范围内,每30分钟有多少辆车被叫(以上车位置来计算)。

  1. -- Number of rides on New Years Eve originating within
  2. -- 400m of Times Square, by 30 min buckets
  3. -- Note: Times Square is at (lat, long) (40.7589,-73.9851)
  4. SELECT time_bucket('30 minutes', pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq
  5. FROM rides
  6. WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400
  7. AND pickup_datetime < '2016-01-01 14:00'
  8. GROUP BY thirty_min ORDER BY thirty_min;
  9. thirty_min | near_times_sq
  10. ---------------------+--------------
  11. 2016-01-01 00:00:00 | 74
  12. 2016-01-01 00:30:00 | 102
  13. 2016-01-01 01:00:00 | 120
  14. 2016-01-01 01:30:00 | 98
  15. 2016-01-01 02:00:00 | 112
  16. 2016-01-01 02:30:00 | 109
  17. 2016-01-01 03:00:00 | 163
  18. 2016-01-01 03:30:00 | 181
  19. 2016-01-01 04:00:00 | 214
  20. 2016-01-01 04:30:00 | 185
  21. 2016-01-01 05:00:00 | 158
  22. 2016-01-01 05:30:00 | 113
  23. 2016-01-01 06:00:00 | 102
  24. 2016-01-01 06:30:00 | 91
  25. 2016-01-01 07:00:00 | 88
  26. 2016-01-01 07:30:00 | 58
  27. 2016-01-01 08:00:00 | 72
  28. 2016-01-01 08:30:00 | 94
  29. 2016-01-01 09:00:00 | 115
  30. 2016-01-01 09:30:00 | 118
  31. 2016-01-01 10:00:00 | 135
  32. 2016-01-01 10:30:00 | 160
  33. 2016-01-01 11:00:00 | 212
  34. 2016-01-01 11:30:00 | 229
  35. 2016-01-01 12:00:00 | 244
  36. 2016-01-01 12:30:00 | 230
  37. 2016-01-01 13:00:00 | 235
  38. 2016-01-01 13:30:00 | 238

实例2 - 传感器数据、天气数据

http://docs.timescale.com/v0.8/tutorials/other-sample-datasets

不再赘述。

timescaleDB 常用API

http://docs.timescale.com/v0.8/api

1、创建时序表

create_hypertable()

Required Arguments

NameDescription
main_tableIdentifier of table to convert to hypertable
time_column_nameName of the column containing time values

Optional Arguments

NameDescription
partitioning_columnName of an additional column to partition by. If provided, number_partitions must be set.
number_partitionsNumber of hash partitions to use for partitioning_column when this optional argument is supplied. Must be > 0.
chunk_time_intervalInterval in event time that each chunk covers. Must be > 0. Default is 1 month.
create_default_indexesBoolean whether to create default indexes on time/partitioning columns. Default is TRUE.
if_not_existsBoolean whether to print warning if table already converted to hypertable or raise exception. Default is FALSE.
partitioning_funcThe function to use for calculating a value's partition.

2、添加多级分片字段

支持hash和interval分片

add_dimension()

Required Arguments

NameDescription
main_tableIdentifier of hypertable to add the dimension to.
column_nameName of the column to partition by.

Optional Arguments

NameDescription
number_partitionsNumber of hash partitions to use on column_name. Must be > 0.
interval_lengthInterval that each chunk covers. Must be > 0.
partitioning_funcThe function to use for calculating a value's partition (see create_hypertable instructions).

3、删除分片

删除指定 时间点、多久 之前的分片

drop_chunks()

Required Arguments

NameDescription
older_thanTimestamp of cut-off point for data to be dropped, i.e., anything older than this should be removed.

Optional Arguments

NameDescription
table_nameHypertable name from which to drop chunks. If not supplied, all hypertables are affected.
schema_nameSchema name of the hypertable from which to drop chunks. Defaults to public.
cascadeBoolean on whether to CASCADE the drop on chunks, therefore removing dependent objects on chunks to be removed. Defaults to FALSE.

4、设置分片时间区间

set_chunk_time_interval()

Required Arguments

NameDescription
main_tableIdentifier of hypertable to update interval for.
chunk_time_intervalInterval in event time that each new chunk covers. Must be > 0.

5、分析函数 - 第一条

first()

Required Arguments

NameDescription
valueThe value to return (anyelement)
timeThe timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查找所有传感器的最早上传的温度值。

  1. SELECT device_id, first(temp, time)
  2. FROM metrics
  3. GROUP BY device_id;

使用递归亦可实现:

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

6、分析函数 - 最后一条

last()

Required Arguments

NameDescription
valueThe value to return (anyelement)
timeThe timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查找每5分钟时间区间内,每个传感器的最新温度值

  1. SELECT device_id, time_bucket('5 minutes', time) as interval,
  2. last(temp, time)
  3. FROM metrics
  4. WHERE time > now () - interval '1 day'
  5. GROUP BY device_id, interval
  6. ORDER BY interval DESC;

使用递归亦可实现:

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

7、分析函数 - 柱状图

histogram()

Required Arguments

NameDescription
valueA set of values to partition into a histogram
minThe histogram’s lower bound used in bucketing
maxThe histogram’s upper bound used in bucketing
nbucketsThe integer value for the number of histogram buckets (partitions)

例如,

电池电量20到60,均分为5个BUCKET区间,返回5+2个值的数组(表示每个bucket区间的记录数),头尾分为别为边界外的记录数有多少。

  1. SELECT device_id, histogram(battery_level, 20, 60, 5)
  2. FROM readings
  3. GROUP BY device_id
  4. LIMIT 10;
  5. device_id | histogram
  6. ------------+------------------------------
  7. demo000000 | {0,0,0,7,215,206,572}
  8. demo000001 | {0,12,173,112,99,145,459}
  9. demo000002 | {0,0,187,167,68,229,349}
  10. demo000003 | {197,209,127,221,106,112,28}
  11. demo000004 | {0,0,0,0,0,39,961}
  12. demo000005 | {12,225,171,122,233,80,157}
  13. demo000006 | {0,78,176,170,8,40,528}
  14. demo000007 | {0,0,0,126,239,245,390}
  15. demo000008 | {0,0,311,345,116,228,0}
  16. demo000009 | {295,92,105,50,8,8,442}

8、分析函数 - 时间区间

类似date_trunc,但是更强大,可以用任意interval进行时间截断。方便用户使用。

time_bucket()

Required Arguments

NameDescription
bucket_widthA PostgreSQL time interval for how long each bucket is (interval)
timeThe timestamp to bucket (timestamp/timestamptz/date)

Optional Arguments

NameDescription
offsetThe time interval to offset all buckets by (interval)

9、数据概貌查看函数 - 时序表概貌

hypertable_relation_size_pretty()

  1. SELECT * FROM hypertable_relation_size_pretty('conditions');
  2. table_size | index_size | toast_size | total_size
  3. ------------+------------+------------+------------
  4. 1171 MB | 1608 MB | 176 kB | 2779 MB

10、数据概貌查看函数 - 分片大小

chunk_relation_size_pretty()

  1. SELECT * FROM chunk_relation_size_pretty('conditions');
  2. chunk_table | table_size | index_size | total_size
  3. ---------------------------------------------+------------+------------+------------
  4. "_timescaledb_internal"."_hyper_1_1_chunk" | 28 MB | 36 MB | 64 MB
  5. "_timescaledb_internal"."_hyper_1_2_chunk" | 57 MB | 78 MB | 134 MB
  6. ...

11、数据概貌查看函数 - 索引大小

indexes_relation_size_pretty()

  1. SELECT * FROM indexes_relation_size_pretty('conditions');
  2. index_name_ | total_size
  3. --------------------------------------+------------
  4. public.conditions_device_id_time_idx | 1143 MB
  5. public.conditions_time_idx | 465 MB

12、导出时序元数据

https://raw.githubusercontent.com/timescale/timescaledb/master/scripts/dump_meta_data.sql

psql [your connect flags] -d your_timescale_db < dump_meta_data.sql > dumpfile.txt  

小结

timescaleDB是一个非常好用的时序数据处理插件,隐藏了分片逻辑(对用户透明),同时提供了大量的API函数接口,以及性能优化。在时序场景使用很赞。

结合PostGIS插件,PostgreSQL在时空处理这块,如虎添翼。

参考

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《时序数据库有哪些特点? TimescaleDB时序数据库介绍》

《PostgreSQL on Linux 最佳部署手册》

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

http://docs.timescale.com/v0.8/introduction

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/正经夜光杯/article/detail/741896
推荐阅读
相关标签
  

闽ICP备14008679号