赞
踩
flink 官方文档:官方文档
在使用事件时间窗口时,可能会发生数据延迟到达的情况,即 Flink 用于跟踪事件时间进度的Watermark已经超过了元素所属窗口的结束时间戳。
默认情况下,当 Watermark 超过窗口末尾时,后期元素会被丢弃。但是,Flink 允许为窗口操作符指定最大允许延迟。Allowed lateness 指定元素在被删除之前可以延迟多少时间,其默认值为 0。在 watermark 超过窗口结束但在它通过窗口结束之前到达的元素加上允许的延迟,仍然添加到窗口中。根据使用的触发器,迟到但未删除的元素可能会导致窗口再次触发。对于EventTimeTrigger.
为了完成这项工作,Flink 会保持窗口的状态,直到它们允许的延迟到期。一旦发生这种情况,Flink 将删除窗口并删除其状态
DataStream<T> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.allowedLateness(<time>)
.<windowed transformation>(<window function>);
final OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};
DataStream<T> input = ...;
SingleOutputStreamOperator<T> result = input
.keyBy(<key selector>)
.window(<window assigner>)
.allowedLateness(<time>)
.sideOutputLateData(lateOutputTag)
.<windowed transformation>(<window function>);
DataStream<T> lateStream = result.getSideOutput(lateOutputTag);
package com.ali.flink.demo.driver; import com.ali.flink.demo.utils.DataGeneratorImpl002; import com.ali.flink.demo.utils.FlinkEnv; import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner; import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.api.common.functions.AggregateFunction; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.java.functions.KeySelector; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.source.datagen.DataGeneratorSource; import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction; import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows; import org.apache.flink.streaming.api.windowing.time.Time; import org.apache.flink.streaming.api.windowing.windows.TimeWindow; import org.apache.flink.util.Collector; import org.apache.flink.util.OutputTag; import java.text.ParseException; import java.text.SimpleDateFormat; import java.time.Duration; import java.util.Date; import java.util.HashSet; import java.util.Random; /** * 允许数据迟到,并将迟到数据写入侧输出流 */ public class FlinkAllowedLatenessDemo01 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = FlinkEnv.FlinkDataStreamRunEnv(); env.setParallelism(1); DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(new DataGeneratorImpl002()); DataStream<String> dataGeneratorStream = env.addSource(dataGeneratorSource).returns(String.class); // dataGeneratorStream.print("source"); SingleOutputStreamOperator<Event> mapStream = dataGeneratorStream .map(new MapFunction<String, Event>() { @Override public Event map(String s) throws Exception { JSONObject jsonObject = JSON.parseObject(s); String username = jsonObject.getString("username"); String eventtime = jsonObject.getString("eventtime"); String click_url = jsonObject.getString("click_url"); SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Random random = new Random(); eventtime = random.nextInt(10) > 2 ? eventtime : simpleDateFormat.format(new Date(simpleDateFormat.parse(eventtime).getTime() - 20000)); return new Event(username, click_url, eventtime); } }); mapStream.print("map source"); OutputTag<Event> late = new OutputTag<Event>("late"){}; SingleOutputStreamOperator<String> outStream = mapStream.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2)) .withTimestampAssigner(new SerializableTimestampAssigner<Event>() { @Override public long extractTimestamp(Event s, long l) { SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date event_time = null; try { event_time = simpleDateFormat.parse(s.eventTime); } catch (ParseException e) { e.printStackTrace(); } return event_time.getTime(); } })) .keyBy(new KeySelector<Event, String>() { @Override public String getKey(Event event) throws Exception { return event.clickUrl; } }) .window(TumblingEventTimeWindows.of(Time.seconds(20))) .allowedLateness(Time.seconds(5)) // 允许5秒的数据迟到延迟 .sideOutputLateData(late) // 将迟到的数据写入侧输出流 .aggregate(new AggregateFunction<Event, HashSet<String>, Long>() { @Override public HashSet<String> createAccumulator() { return new HashSet<>(); } @Override public HashSet<String> add(Event event, HashSet<String> set) { set.add(event.userName); return set; } @Override public Long getResult(HashSet<String> set) { return Long.valueOf(set.size()); } @Override public HashSet<String> merge(HashSet<String> set, HashSet<String> acc1) { for (String s : acc1) { set.add(s); } return set; } }, new ProcessWindowFunction<Long, String, String, TimeWindow>() { @Override public void process(String value, Context context, Iterable<Long> iterable, Collector<String> collector) throws Exception { SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); String start_window = simpleDateFormat.format(context.window().getStart()); String end_window = simpleDateFormat.format(context.window().getEnd()); Long size = iterable.iterator().next(); collector.collect(start_window + "--" + end_window + "--" + value + "--" + size); } }); outStream.print("count"); outStream.getSideOutput(late).print("late"); env.execute("tumble window test"); } public static class Event{ private String userName; private String clickUrl; private String eventTime; public Event(String userName, String clickUrl, String eventTime) { this.userName = userName; this.clickUrl = clickUrl; this.eventTime = eventTime; } @Override public String toString() { return "Event{" + "userName='" + userName + '\'' + ", clickUrl='" + clickUrl + '\'' + ", eventTime='" + eventTime + '\'' + '}'; } } }
map source> Event{userName='bbb', clickUrl='url1', eventTime='2022-07-07 15:41:35'}
map source> Event{userName='ccc', clickUrl='url1', eventTime='2022-07-07 15:41:42'}
count> 2022-07-07 15:41:20--2022-07-07 15:41:40--url1--1
map source> Event{userName='aaa', clickUrl='url1', eventTime='2022-07-07 15:41:43'}
map source> Event{userName='bbb', clickUrl='url2', eventTime='2022-07-07 15:41:48'}
map source> Event{userName='ccc', clickUrl='url2', eventTime='2022-07-07 15:41:37'}
late> Event{userName='ccc', clickUrl='url2', eventTime='2022-07-07 15:41:37'}
map source> Event{userName='aaa', clickUrl='url1', eventTime='2022-07-07 15:42:06'}
map source> Event{userName='aaa', clickUrl='url2', eventTime='2022-07-07 15:41:46'}
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url2--1
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url1--2
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url2--2
map source> Event{userName='ccc', clickUrl='url1', eventTime='2022-07-07 15:41:49'}
count> 2022-07-07 15:41:40--2022-07-07 15:42:00--url1--2
总结
当map source> Event{userName=‘ccc’, clickUrl=‘url2’, eventTime=‘2022-07-07 15:41:37’} 数据到来时,前一条数据是map source> Event{userName=‘bbb’, clickUrl=‘url2’, eventTime=‘2022-07-07 15:41:48’},当前窗口是2022-07-07 15:41:40–2022-07-07 15:42:00,以及错过了 2022-07-07 15:41:20–2022-07-07 15:41:40 窗口,同时也超出了5s的延迟时间,所以被放入了侧输出流中。
当map source> Event{userName=‘aaa’, clickUrl=‘url2’, eventTime=‘2022-07-07 15:41:46’}和map source> Event{userName=‘ccc’, clickUrl=‘url1’, eventTime=‘2022-07-07 15:41:49’}到来时,前一条数据是map source> Event{userName=‘aaa’, clickUrl=‘url1’, eventTime=‘2022-07-07 15:42:06’},但是设置了5s延迟,以及watermasrk的2s,所以当前窗口依旧是2022-07-07 15:41:40–2022-07-07 15:42:00,上面两条数据都在窗口范围内,可以进行计算。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。