当前位置:   article > 正文

Flink 窗口处理迟到数据_flink sql中使用窗口函数来检查数据延迟的问题

flink sql中使用窗口函数来检查数据延迟的问题

Flink 窗口处理迟到数据

flink 官方文档:官方文档

定义:

在使用事件时间窗口时,可能会发生数据延迟到达的情况,即 Flink 用于跟踪事件时间进度的Watermark已经超过了元素所属窗口的结束时间戳。

默认情况下,当 Watermark 超过窗口末尾时,后期元素会被丢弃。但是,Flink 允许为窗口操作符指定最大允许延迟。Allowed lateness 指定元素在被删除之前可以延迟多少时间,其默认值为 0。在 watermark 超过窗口结束但在它通过窗口结束之前到达的元素加上允许的延迟,仍然添加到窗口中。根据使用的触发器,迟到但未删除的元素可能会导致窗口再次触发。对于EventTimeTrigger.

为了完成这项工作,Flink 会保持窗口的状态,直到它们允许的延迟到期。一旦发生这种情况,Flink 将删除窗口并删除其状态

API文档

  • 设置迟到时间:
DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .<windowed transformation>(<window function>);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 使用Flink 测输出流功能,获取迟到数据,并输出到侧输出流
final OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};

DataStream<T> input = ...;

SingleOutputStreamOperator<T> result = input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .sideOutputLateData(lateOutputTag)
    .<windowed transformation>(<window function>);

DataStream<T> lateStream = result.getSideOutput(lateOutputTag);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

示例:

package com.ali.flink.demo.driver;

import com.ali.flink.demo.utils.DataGeneratorImpl002;
import com.ali.flink.demo.utils.FlinkEnv;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.datagen.DataGeneratorSource;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.util.Date;
import java.util.HashSet;
import java.util.Random;

/**
 * 允许数据迟到,并将迟到数据写入侧输出流
 */
public class FlinkAllowedLatenessDemo01 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = FlinkEnv.FlinkDataStreamRunEnv();

        env.setParallelism(1);

        DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(new DataGeneratorImpl002());

        DataStream<String> dataGeneratorStream = env.addSource(dataGeneratorSource).returns(String.class);
//        dataGeneratorStream.print("source");

        SingleOutputStreamOperator<Event> mapStream = dataGeneratorStream
                .map(new MapFunction<String, Event>() {
                    @Override
                    public Event map(String s) throws Exception {
                        JSONObject jsonObject = JSON.parseObject(s);
                        String username = jsonObject.getString("username");
                        String eventtime = jsonObject.getString("eventtime");
                        String click_url = jsonObject.getString("click_url");
                        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                        Random random = new Random();
                        eventtime = random.nextInt(10) > 2 ? eventtime : simpleDateFormat.format(new Date(simpleDateFormat.parse(eventtime).getTime() - 20000));
                        return new Event(username, click_url, eventtime);
                    }
                });

        mapStream.print("map source");

        OutputTag<Event> late = new OutputTag<Event>("late"){};

        SingleOutputStreamOperator<String> outStream = mapStream.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                    @Override
                    public long extractTimestamp(Event s, long l) {
                        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                        Date event_time = null;
                        try {
                            event_time = simpleDateFormat.parse(s.eventTime);
                        } catch (ParseException e) {
                            e.printStackTrace();
                        }
                        return event_time.getTime();
                    }
                }))
                .keyBy(new KeySelector<Event, String>() {
                    @Override
                    public String getKey(Event event) throws Exception {
                        return event.clickUrl;
                    }
                })
                .window(TumblingEventTimeWindows.of(Time.seconds(20)))
                .allowedLateness(Time.seconds(5))  // 允许5秒的数据迟到延迟
                .sideOutputLateData(late)   // 将迟到的数据写入侧输出流
                .aggregate(new AggregateFunction<Event, HashSet<String>, Long>() {
                    @Override
                    public HashSet<String> createAccumulator() {
                        return new HashSet<>();
                    }

                    @Override
                    public HashSet<String> add(Event event, HashSet<String> set) {
                        set.add(event.userName);
                        return set;
                    }

                    @Override
                    public Long getResult(HashSet<String> set) {
                        return Long.valueOf(set.size());
                    }

                    @Override
                    public HashSet<String> merge(HashSet<String> set, HashSet<String> acc1) {
                        for (String s : acc1) {
                            set.add(s);
                        }
                        return set;
                    }
                }, new ProcessWindowFunction<Long, String, String, TimeWindow>() {
                    @Override
                    public void process(String value, Context context, Iterable<Long> iterable, Collector<String> collector) throws Exception {
                        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                        String start_window = simpleDateFormat.format(context.window().getStart());
                        String end_window = simpleDateFormat.format(context.window().getEnd());
                        Long size = iterable.iterator().next();
                        collector.collect(start_window + "--" + end_window + "--" + value + "--" + size);
                    }
                });
        outStream.print("count");
        outStream.getSideOutput(late).print("late");

        env.execute("tumble window test");
    }

    public static class Event{
        private String userName;
        private String clickUrl;
        private String eventTime;

        public Event(String userName, String clickUrl, String eventTime) {
            this.userName = userName;
            this.clickUrl = clickUrl;
            this.eventTime = eventTime;
        }

        @Override
        public String toString() {
            return "Event{" +
                    "userName='" + userName + '\'' +
                    ", clickUrl='" + clickUrl + '\'' +
                    ", eventTime='" + eventTime + '\'' +
                    '}';
        }
    }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146

结果:

map source> Event{userName='bbb', clickUrl='url1', eventTime='2022-07-07 15:41:35'}
map source> Event{userName='ccc', clickUrl='url1', eventTime='2022-07-07 15:41:42'}
count> 2022-07-07 15:41:20--2022-07-07 15:41:40--url1--1
map source> Event{userName='aaa', clickUrl='url1', eventTime='2022-07-07 15:41:43'}
map source> Event{userName='bbb', clickUrl='url2', eventTime='2022-07-07 15:41:48'}
map source> Event{userName='ccc', clickUrl='url2', eventTime='2022-07-07 15:41:37'}
late> Event{userName='ccc', clickUrl='url2', eventTime='2022-07-07 15:41:37'}
map source> Event{userName='aaa', clickUrl='url1', eventTime='2022-07-07 15:42:06'}
map source> Event{userName='aaa', clickUrl='url2', eventTime='2022-07-07 15:41:46'}
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url2--1
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url1--2
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url2--2
map source> Event{userName='ccc', clickUrl='url1', eventTime='2022-07-07 15:41:49'}
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url1--2
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 总结
    当map source> Event{userName=‘ccc’, clickUrl=‘url2’, eventTime=‘2022-07-07 15:41:37’} 数据到来时,前一条数据是map source> Event{userName=‘bbb’, clickUrl=‘url2’, eventTime=‘2022-07-07 15:41:48’},当前窗口是2022-07-07 15:41:40–2022-07-07 15:42:00,以及错过了 2022-07-07 15:41:20–2022-07-07 15:41:40 窗口,同时也超出了5s的延迟时间,所以被放入了侧输出流中。

    当map source> Event{userName=‘aaa’, clickUrl=‘url2’, eventTime=‘2022-07-07 15:41:46’}和map source> Event{userName=‘ccc’, clickUrl=‘url1’, eventTime=‘2022-07-07 15:41:49’}到来时,前一条数据是map source> Event{userName=‘aaa’, clickUrl=‘url1’, eventTime=‘2022-07-07 15:42:06’},但是设置了5s延迟,以及watermasrk的2s,所以当前窗口依旧是2022-07-07 15:41:40–2022-07-07 15:42:00,上面两条数据都在窗口范围内,可以进行计算。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/544107
推荐阅读
相关标签
  

闽ICP备14008679号