当前位置:   article > 正文

Windows上视频的tensorflow对象检测10

cv2.resize 清晰度损失

Previous article: “TensorFlow Object Detection in Windows (under 30 lines)”, covers about 95% of the same code displayed below with an explanation of each line, we will only look forward to the amendments made in previous code that now enables it to run inference on Videos instead of images.

上一篇文章:“ Windows中的TensorFlow对象检测(在30行之内) ”,涵盖了下面显示的同一代码的大约95%,并对每一行进行了说明,我们将只期待在以前的代码中进行的修改,现在可以使它能够对视频而不是图像进行推断。

Line 10–33

10–33行

MODEL_NAME = ‘ssd_mobilenet_v2_coco’VIDEO_NAME =time_sq.mp4’CWD_PATH = os.getcwd()PATH_TO_CKPT = os.path.join(CWD_PATH,MODEL_NAME,’frozen_inference_graph.pb’)PATH_TO_LABELS = os.path.join(CWD_PATH,’data’,’mscoco_complete_label_map.pbtxt’)NUM_CLASSES = 90label_map = label_map_util.load_labelmap(PATH_TO_LABELS)categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)category_index = label_map_util.create_category_index(categories)config = tf.ConfigProto()config.gpu_options.per_process_gpu_memory_fraction = 1detection_graph = tf.Graph()with detection_graph.as_default():    od_graph_def = tf.GraphDef()    with tf.gfile.GFile(PATH_TO_CKPT, ‘rb’) as fid:        serialized_graph = fid.read()        od_graph_def.ParseFromString(serialized_graph)        tf.import_graph_def(od_graph_def, name=’’)    sess = tf.Session(graph=detection_graph, config = config)image_tensor = detection_graph.get_tensor_by_name(‘image_tensor:0’)detection_boxes = detection_graph.get_tensor_by_name(‘detection_boxes:0’)detection_scores = detection_graph.get_tensor_by_name(‘detection_scores:0’)detection_classes = detection_graph.get_tensor_by_name(‘detection_classes:0’)num_detections = detection_graph.get_tensor_by_name(‘num_detections:0’)

In summary, Lines above are responsible for the following task, In-depth analysis can be seen here.

综上所述,以上各行负责以下任务,在此处可以进行深入分析。

  • Model and video names are stored in a variable.

    型号和视频名称存储在变量中。
  • The number of classes is defined i.e 90 in COCO dataset.

    在COCO数据集中定义了类别数,即90。
  • Label map is used to create the category_index dict which stores the relation between integers and their respective classes.

    标签映射用于创建category_index dict ,该dict存储整数与它们各自的类之间的关系。

  • TensorFlow graph is loaded into Session defined as sess

    将TensorFlow图加载到定义为sess Session中

  • Python variables are declared that are responsible to feed and extract data from the loaded TensorFlow session.

    声明了Python变量,这些变量负责从已加载的TensorFlow会话中获取和提取数据。

Line 34–38:

第34–38行:

video = cv2.VideoCapture(VIDEO_NAME)while True:    stime = time.time()    ret, frame = video.read()    if ret == True:

We are using OpenCV to read our .mp4/.mov video file. As the video is but a series of images we need a continuous loop to go through all the frames OpenCV is throwing our way through ret, frame = video.read() , therefore we initiate a whilte True: loop that is always true until declared otherwise. OpenCV’s .read() extension will return 2 values i.e the state of the returned video if it's True or False and the array of pixel values representing the image(eg 1280x720x3). False value of ret means there is nothing to read and the following frame variable is bound to come up empty. Therefore ret acts as our check conditionality for being in the loop or break out of it. We will be using this to our advantage and only begin the detection part if we have image data available else break out of the while loop, which can be seen inline-38(if ret == True:).

我们正在使用OpenCV读取我们的.mp4/.mov视频文件。 由于视频只是一系列图像,因此我们需要一个连续的循环来遍历所有帧OpenCV通过ret, frame = video.read() ,因此我们发起了一个whilte True:循环,直到声明它始终为真除此以外。 OpenCV.read()扩展名将返回2个值,即返回的视频的状态(如果是True或False)以及表示图像的像素值数组(例如1280x720x3)。 ret的False值意味着没有要读取的内容,并且随后的frame变量势必会变成空。 因此,ret是我们处于循环状态或退出循环的检查条件。 我们将利用这个优势,只有在我们有可用的图像数据时才开始检测部分,否则会从while循环中退出,这可以在inline-38中看到( if ret == True:

Line 39–40

39-40行

frame = cv2.resize(frame,(300,300))frame_expanded = np.expand_dims(frame, axis = 0)

Our frame from the video may be of any size and this might be the problem for the architecture we have loaded in our memory (only if the frame is in high resolution eg. 1280x720 HD video). The high-resolution video will consume more time to get processed and thus reducing the FRAME PER SECOND smoothness we are expecting at the end of our program. In order to improve the FPS of the output video, we will be reducing the frame size to 300x300, this will help our cause. And then as needed we will be adding one extra dimension to our frame so that it can be loaded to our graph that we will see in the next couple of lines.

视频中的frame可能是任意大小,这可能是我们加载到内存中的体系结构的问题(仅当帧为高分辨率时,例如1280x720 HD视频)。 高分辨率视频将花费更多时间进行处理,从而降低了我们在程序结束时期望的每秒FRAME平滑度。 为了提高输出视频的FPS,我们将帧大小减小到300x300,这将有助于我们的事业。 然后,根据需要,我们将在框架中添加一个额外的尺寸,以便可以将其加载到我们的图形中,以便在接下来的几行中看到。

Line 41–43

41–43行

(boxes, scores, classes, num) = sess.run([detection_boxes, detection_scores, detection_classes, num_detections],feed_dict = {image_tensor: frame_expanded}) vis_util.visualize_boxes_and_labels_on_image_array(frame,np.squeeze(boxes),np.squeeze(classes).astype(np.int32),np.squeeze(scores),category_index,use_normalized_coordinates=True,line_thickness=1,min_score_thresh=0.75)cv2.imshow(‘output’, frame)

In short, we are sending the frame into our graph and then the obtained results are processed and bounding boxes are drawn using the vis_util library, a long version of the same can be found here.

简而言之,我们将框架发送到我们的图形中,然后使用vis_util库处理获得的结果并绘制边界框,可以在此处找到它的长版本。

Line 44–51

44–51行

    print(‘FPS {:.1f}’.format(1/ (time.time() — stime)))    if cv2.waitKey(1) & 0xFF == ord(‘q’):        break    if ret == False:        print(‘vid not Present’)        breakvideo.release()cv2.destroyAllWindows()

Almost all of the work has been already done, here we will be just setting up a termination condition that pressing the ‘q’ on the keyboard will terminate the loop and end the program. Also when our ret variable is NOT TRUE, which means we have no data on our hands to process into the graph henceforth we will break the while loop and terminate the program. video.release() and cv2.destroyAllWindows() are responsible for freeing the video variable of image data and closing all output windows respectively.

几乎所有工作都已经完成,这里我们将设置一个终止条件,即按下键盘上的“ q”将终止循环并结束程序。 同样,当我们的ret变量不是TRUE时,这意味着我们手上没有数据可以处理成图表,因此我们将打破while循环并终止程序。 video.release()cv2.destroyAllWindows()分别负责释放图像数据的video变量并关闭所有输出窗口。

https://www.videvo.net/video/mcdonalds-in-times-square-/1772/ -https://www.videvo.net/video/mcdonalds-in-times-square-/1772/

let's run our program for the above video, the download link is provided here. What will be the logical expectation out of it?

让我们为上面的视频运行我们的程序, 这里提供下载链接。 逻辑上的期望是什么?

This is what it looks like:

看起来是这样的:

Shocked? Wondering what's wrong? What did we miss?

震惊了吗? 想知道怎么了? 我们错过了什么?

NOTE: Your mileage may vary depending on the configuration of your machine, It may be close to 30 FPS or maybe worse than 5 FPS. I have implemented this code on RTX 2060 with 6 GBDDR6 memory coupled with i5–8600 CPU @ 3.10 GHz.

注意:您的行驶里程可能会因计算机的配置而异,可能接近30 FPS,甚至可能低于5 FPS。 我已经在RTX 2060上实现了该代码,它具有6 GBDDR6内存以及3.10 GHz的i5-8600 CPU。

Let's analyze the code again Line:39 frame = cv2.resize(frame,(300,300)) This explains the loss of quality, this is why we see pixelated images, sharpness loss. But there is a considerable amount of jitter/lag/ in the video it doesn’t seem to be that smooth as it was originally, And what do you think the reason might be for that?

让我们再次分析代码Line:39 frame = cv2.resize(frame,(300,300))这解释了质量的损失,这就是为什么我们看到像素化图像,清晰度损失的原因。 但是视频中有相当多的抖动/滞后/,它似乎不像最初那样平滑,您认为原因是什么呢?

IN BIG words the answer is FPS i.e Frames Per Second of a video is causing this effect of jitter/lag. WHY FPS becomes an issue? to learn in detail click here. In short, FPS is the output by our graph, it is the power at which our graph can process an image, for the jittery video above, it was an avg. 14 FPS.

用大词来说,答案是FPS,即视频的每秒帧数引起这种抖动/滞后效应。 为什么FPS成为问题? 要了解详细信息,请单击此处 。 简而言之,FPS是图形的输出,是图形处理图像的能力,对于上面的抖动视频,这是一个平均值。 14 FPS。

For a video to be smooth, it should be above 25 FPS(The standard for playback in the UK is 25 frames per second (fps) and in the US it’s 30fps.), whereas youtube supports 24,25,30,48,50,60 FPS and majority of video have 29.97 FPS. This makes the video as effortless as it seems.

对于视频是光滑的,它应该是高于25 FPS( 在英国重放的标准是每秒 (fps)和在美国,它的30fps的 25帧 。),而YouTube支持24,25,30,48,50 ,60 FPS,大多数视频具有29.97 FPS。 这使视频看起来毫不费力。

So now we understand why our results are bad because it is at 14FPS, and why we got 14FPS? when our input is at 30FPS? there are many factors behind this, the major ones are as follows:

因此,现在我们了解了为什么在14FPS时结果不好,为什么会得到14FPS? 当我们的输入为30FPS时? 这背后有许多因素,主要因素如下:

  • Model-Architecture limit: Every Architecture comes with its upper limit, as they are nothing but a pre-arranged set of numbers that run a series of calculations on an image. More the numbers(heavier the model) meaning more the calculation and thus more the time spent on each image thereby reducing the number of images processed per second. You can learn more about SS-Mobilenet architecture and its internal function here.

    模型架构限制:每个架构都有其上限,因为它们不过是一组预先安排的数字,可以对图像进行一系列计算。 数字越多(模型越重)意味着要进行更多的计算,因此要花更多的时间在每张图像上,从而减少每秒处理的图像数量。 您可以在此处了解有关SS-Mobilenet体系结构及其内部功能的更多信息

  • Image-Resolution: Image Resolution play s a considerable role, as it is a direct representation of the amount of data that is being fed to our graph per image, more the resolution more time it is bound to take thus reducing the FPS.

    图像分辨率:图像分辨率起着相当重要的作用,因为它直接表示每个图像馈送到我们的图形的数据量,分辨率越高,绑定时间就越长,从而降低了FPS。
  • Hardware: Bigger the better, if by any chance you are running this on TF-cpu you will have a hard time getting 4–5 FPS. Make sure you have suitable hardware.

    硬件:越大越好,如果有机会在TF-cpu上运行它,将很难获得4-5 FPS。 确保您拥有合适的硬件。
Image for post
Full-RES 1920x1080 NODs-FPS
全分辨率1920x1080 NODs-FPS

For the given Video, we created a Data Frame compiling the NODs(number of Detections) and FPS for Full-resolution video and plotted the graph as above, NODs and FPs share the same y-axis with NODs having values between 0–8 objects detected per frame and FPS ranging from 4–14 per second.

对于给定的视频,我们创建了一个数据帧,为全分辨率视频编译了NOD(检测次数)和FPS,并绘制了上述图形,NOD和FP共享相同的y轴,而NOD的值介于0-8个对象之间每帧检测到的速度和FPS范围为每秒4–14。

The x-axis is the number of frames, total of 391 frames in 14-sec video making the FPS of input video of about 28. The orange line represents the number of Detection made by our architecture per image and Blue line is the FPS value at that instant, it can be clearly seen that whenever we have high NODs, FPS value at that instant drops. Remember that our model is looking for 90 types of objects in every frame. The duration where there is no object to detect we have fairly high FPS output. I make the subject more clear I will be amalgamating data from 3 versions of the same video:

x轴是帧数,在14秒视频中总共391帧,使输入视频的FPS约为28。橙色线表示我们的体系结构每张图像进行检测的次数,蓝色线表示FPS值在那一瞬间,可以清楚地看到,只要我们的NOD很高,那一刻的FPS值就会下降。 请记住,我们的模型正在每帧中寻找90种对象。 没有物体可检测的持续时间,我们有相当高的FPS输出。 我让主题更加清楚,我将合并来自同一视频的3个版本的数据:

Image for post

The image on the left shows the avg. FPS and total Number of Detections, it's quite clear that the highest FPS was obtained on the least resolution video and as discussed earlier we see the huge drop in NODs as well. Meaning most of the data is lost as noise. Whereas the highest NODs were seen at full resolution video but it came with the lowest FPS as well. To deploy the model or least make it more useful to curate our needs we need to find a perfect balance between acceptable results(accurate and precise bounding box given object is detected) and FPS.

左侧的图片显示了平均值。 FPS和总检测数量,很明显,在分辨率最低的视频上获得了最高的FPS,并且如前所述,我们也看到了NOD的大幅下降。 意味着大多数数据作为噪声丢失。 虽然在全分辨率视频中看到的NOD最高,但FPS也最低。 为了部署模型或至少使其更能满足我们的需求,我们需要在可接受的结果(检测到给定对象的准确且精确的边界框)和FPS之间找到完美的平衡。

Image for post

The image above is the Graph showing the overlap data obtained in 3 runs with diff resolutions.

上图是显示以差异分辨率在3次运行中获得的重叠数据的图。

Running inference at Full resolution 1920x1080
以1920x1080全分辨率运行推理

Now we know how to implement TensorFlow Object Detection on Video by using a pre-trained model-architecture SSD-MobileNet for example seen here. The results are in the acceptable range, and you can easily implement the same, but what if your need is very specific and this architecture SSD-MobileNET trained on COCO dataset is unable to detect the specific class you are interested in, maybe you have a completely different set of demands and the list of 90 object categories doesn't cater to your need. You may have your own dataset with totally different object categories, it may be MRI scans to look for an anomaly, it may be images of an apple orchard with marked apples as annotations.

现在我们知道如何通过使用预训练的模型架构SSD-MobileNet来在视频上实现TensorFlow对象检测,例如此处所示。 结果在可接受的范围内,您可以轻松实现相同的结果,但是如果您的需求非常具体并且在COCO数据集上受训的SSD-MobileNET体系结构无法检测到您感兴趣的特定类,该怎么办?完全不同的需求集和90种对象类别的列表无法满足您的需求。 您可能具有完全不同的对象类别的自己的数据集,可能是MRI扫描以查找异常,也可能是带有标记的苹果的苹果园的图像。

This deviation from the original dataset and inability of our Model can be well addressed by TRAINING the Model to our own custom created dataset. This dataset can be any set of images annotated properly, the size of the dataset is also a variable user may decide and then follow certain steps and TRAIN the existing MODEL to recognize our own object category. After training, we will use this newly trained model for our specific need that is now bound to produce better results.

通过将模型训练到我们自己创建的数据集上,可以很好地解决与原始数据集的偏差和模型无法使用的问题。 该数据集可以是适当注释的任何图像集,数据集的大小也是用户可以决定然后遵循某些步骤并训练现有模型以识别我们自己的对象类别的变量。 训练后,我们将使用这种新训练的模型来满足我们的特定需求,现在肯定会产生更好的结果。

As stated in the article: Practical aspects to select a Model for Object Detection.

如文章中所述: 选择对象检测模型的实际方面

We are here to obtain the result as shown on the left side of the above video, the right side is the same model we just ran on our example video, but because our example video was almost a home ground for the model, we saw acceptable performance out of it. The moment we changed the conditions of the environment the results are drastically affected. It is quite clear how poor the performance is, what steps we need to take in order to CREATE, TRAIN on a custom dataset is discussed in the next article.

我们在这里获得上面视频左侧所示的结果,右侧是我们在示例视频中运行的相同模型,但是由于示例视频几乎是该模型的基础,因此我们认为可以接受表现出来了。 一旦我们改变了环境条件,结果就会受到严重影响。 非常清楚的是,性能有多么差,我们需要采取哪些步骤才能在自定义数据集上创建,训练TRAIN,将在下一篇文章中进行讨论。

Code available here.

代码在这里

Other related articles are:

其他相关文章是:

翻译自: https://medium.com/@deep12vish/tensorflow-object-detection-for-videos-onwindows-10-1c1a9ffd6cac

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/497419
推荐阅读
相关标签
  

闽ICP备14008679号