Embedded video analytics for object detection and tracking using multiscale features |
|
Please, note that the article is automatically translated from Russian into English, so the translation may not be accurate. Dr Nikolai Ptitsyn, Synesis, This e-mail address is being protected from spambots. You need JavaScript enabled to view it GraphiСon 2010 Novel video analytics algorithms are presented enabling embedded motion detection and object tracking for CCTV systems. The motion detection algorithm is based on the neurobiological mechanism of the primary visual cortex V1. A sequence of simple pixel operations is used including linear operators (weighted sum) and nonlinear operator (max, saturation). The object tracking algorithm is a hybrid of two approaches: (1) time series analysis of motion detector regions and (2) space correlation between the current frame features and object model features. Unique advantages of the present analytics pipeline include the efficiency on the high definition (HD) video stream and the ability to track low contrast overlapping objects against a dynamic background. The embedded video analytics is implemented and deployed on different platforms including the Texas Instruments’s DSP. A comprehensive testing environment was setup used to estimate the overall performance of the video analytics implementations. A fully embedded implementation on DSP has been i-LIDS approved both as a primary detection system for operational alert use and as an event based recording system in sterile zone monitoring applications. 1. IntroductionAutomating the processing of streaming video systems, CCTV is an important scientific and engineering challenge. Called video analytics software based on algorithms for machine vision, which allows the detection, maintain, classify and / or to identify moving objects in the field of view camera without operator [1], pp. 287-312. A promising direction is the incorporation of video analytics algorithms directly into the camera or ip-device [2]. Compared with the server implementation, integrated analytics processing video, without the distortions introduced by analog or digital communications channel. This built-in video processing at a higher resolution and high frame rate potentially provides higher accuracy. On the other hand, the known algorithms for video analytics is quite difficult to adapt for the built-in implementing chambers mass production because of their computational complexity. Hardware resources of single-chip camera platform (a set of processor instructions, the clock speed and memory size) is limited due to the limitations on heat dissipation and cost. Thus, the need to develop fundamentally new algorithms with better computational efficiency, especially if the camera uses a sensor in high definition. The main task of embedded intelligence - to provide initial detection and tracking of target in the field of view. The result of such analysts are the coordinates, the trajectory and characteristics of objects. Other tasks, such as refining the classification, identification and mezhkamernoe support can be effectively executed on the server side. 2. The classical approach to the detection of objectsThe general algorithmic approach to the detection of moving objects is to analyze the differences between the current frame and background model. Simplistically, this approach is called the subtraction of the background (background subtraction). Pixel belonging to an object (or background) is determined based on the deviation value (brightness) pixel in the current frame of a statistical model to estimate the background (see video). There are many methods for modeling the background image [3]. The most common methods for running Gaussian average (running Gaussian average) and Gaussian mixture (mixture of Gaussians):
At the stage of segmentation, the individual pixels are extracted from the background, are united in the regions with the help of morphological operations [4], pp. 481-495. Regions that match the size and shape, can be considered targets. As a rule, the phase morphology analysis of the largest computational cost. The complexity of the analysis increases nonlinearly with the increase in the area of regions and their number. The main problems embedded analytics based on the classical approach are as follows:
3. The new algorithm3.1 Neurobiological mechanismThe idea of the proposed algorithm is borrowed from nature, where the evolution of the nervous system of living beings achieved outstanding results in the field of video analysis [5]. Consider the functional diagram of the primary visual cortex V1 (Fig. 1) developed, in particular, primates and humans. The neural network is composed of cells of two types:
where
Figure 1: Image processing in the primary visual cortex: dash - signs corresponding to directional filter; S1, S2 - layers of simple cells; C1, C2 - complex layers of cells, the solid blue line - a weighted sum, the dashed green line - the operation of selecting the maximum. At the entrance of the neural network the original image from the retina of the eye is processed by a simple cell S1. Simple cells implement directed filtering, which identifies the boundaries of certain orientations. The problem of directional filter - to select the characteristics of images that are invariant to illumination. Fig. 1 shows the four directions: horizontal, vertical and two diagonal. In the field of machine vision is widely used similar detectors boundaries based on the gradient [4], pp. 315-338. At the level of complex cells C1 group is simple cells S1 in each direction and allocation of the maximum value. Complex cell possesses selectivity on the basis provides invariance bias input neuron in a neighborhood gang. At the level of simple cells S2 produced a weighted sum of outputs of complex cells C1. As a result of summing the signals on different grounds from the level of S2, a composite characteristics that combine local data on several fronts. They are similar to symptoms of Haar, but at the expense of the previous layer complex cells better generalize the deformable shape. At the level of complex cells C2 again applies a nonlinear operation is max, and are grouped together not only the outputs of prior S2 (position invariance) but also yields more of the lower layer C1 (scale invariance). Thus, at the level of C2 simple and complex features are combined to achieve invariance to shift and scale simultaneously. On the other hand, the important features of visual cortex cells is (1) non-linear behavior in time and (2) nonlinear transfer contrast [6]. Dwell on the features (2): nonlinear transformation of contrast is expressed in saturation (the operation saturate) the output value of the trait on a certain level, which provides its valuation in terms of non-uniform contrast. Saturation appears as a simple and complex cells. Naka-Rushton equation (Naka-Rushton) approximates a transfer characteristic of saturation (Fig. 2):
where
Figure 2: Transfer characteristics of the cells of the visual cortex Note the following features of the visual cortex V1 for their subsequent adaptation to machine analytics:
Max and sum operations are used alternately and iteratively. The composition of the set of two simple operations (linear and nonlinear) generates a highly complex system of video analysis as a whole. Similar technique is used in block encryption algorithms to maximize the diffusion of data within the block for a minimum number of arithmetic operations. 3.2 Multiscale representationMultiscale approach [4], pp. 125-142, has already been successfully used to detect motion [7] and the segmentation of complex scenes [8]. However, these algorithms today are not suitable for mass application in video surveillance cameras because of their computational complexity. In this paper we consider approaches that reduce resource use algorithms by several orders and apply algorithms in embedded systems, video analysis. Consider a multi-scale representation of one feature (simple or composite) in the form of a pyramid in Fig. 3. These pyramids may be several for each trait, as well as for segmentation masks and other auxiliary data.
Figure 3: Two phases of multi-scale video analysis: x, y - space coordinates of the image, s - the space scale. Optimal algorithm for embedded video analytics includes two phases of processing multi-scale data:
During backpropagation each subsequent layer of the pyramid is calculated using simple linear and nonlinear operations discussed above. Sum operation can be a analog operations reduce the construction of Gaussian pyramid [4], pp. 137. Operationmax enhances the characteristics of and does not give them a "smeared" in a pyramid, and as discussed above, provides a shift-invariant and scale. Saturate operation is important for stable operation of the detectors in a non-uniform illumination and noise. Optimal set of features and a sequence of operations, sum, max, and saturate depends on the specific functions performed by internal analysts:
Fig. Figure 2 shows the result of applying machine-Video using three attributes (brightness, and two saturated gradient). The algorithm is sure to register ducks in a changing background (water ripple with contrasting reflections). Used unimodal probability background model, multiscale segmentator, described below, and a primitive tracking algorithm (the binding regions in the trajectory without constructing a statistical model of the object).
Figure 4: Tracking ducks in a changing and contrasting background. Full video can be viewed at http://www.youtube.com/watch?v=PmJTnClUjYw 3.3 Object segmentationThe mask of the object, calculated segmentatorom useful for calculating the characteristics of the object for his support and for more accurate modeling of the background. As noted above, the major shortcoming of the classical approach is the high resource intensity of morphological operations on the stage of defining mask regions. Multiscale approach allows us to significantly improve the computational efficiency analytics by limiting the depth segmentation and / or use of the model object's shape. At the same dimensions of the detected objects, which can vary significantly in field of view, will not significantly affect the computational cost of the algorithm. Consider in detail the proposed algorithmic approach. During the back-propagation signal in the pyramid (Fig. 3), formed a multiscale mask region. Segmentation is performed from coarse to detailed mask, and the process can be stopped after reaching the desired level of detail or after the exhaustion of the quota of computing resources. Input is a mask from the previous region, the difference between the signs of the current frame and background model on the current layer, as well as an optional model of the detected object. Output data is a mask of the region on the current layer. At each pixel is clarifying mask based on the method of binding in the pyramid [4], pp. 433-436. Fig. 5 shows the result of multiresolution segmentatora based on the method of linking in a pyramid on several grounds: the top - the original frame, the processed video analytics with the trajectory followed by the person; below - four layer mask segmentation. Masks contain some minor errors in the form of points and imprecise boundaries, due to computational optimization of morphological operators.
Figure 5: Multiscale segmentation: original image and the mask with increasing detail 3.4 Hybrid system maintenanceAlgorithms for tracking (tracing) provide a trajectory of objects for more accurate recognition and assessment of dynamic features. In classical implementations, integrated analytics popular following algorithmic approaches:
Figure 6: Maintenance of a volatile object by using the algorithm of binding regions More detailed methods for tracking objects in the video stream are described in [4], pp. 375-412. In this paper, we propose a hybrid method based on the approach (1) and (2). On the one hand, time series analysis found regions can effectively accompany isolated objects, including that significantly alter the shape (Fig. 6). On the other hand, the correlation method allows us to maintain the objects in the group (Fig. 7) or when the detector is not sensitive enough to locate the regions. Combining the results of algorithms based approaches (1) and (2) is produced by selecting the most likely estimate of the object. For approach (1) serves as a basis for evaluating the contrast of the region against the background, and to approach (2) - values of the correlation characteristics of the object area. Approach (1) prohibited by mutual overlap tracked objects. Multiscale representation of the mask and attributes of the object to significantly increase the computational efficiency of the correlation algorithm and increase the search radius, that is, solves the problems of the aperture [4], pp. 379.
Figure 7: Individual support facilities at the time of the meeting (above) and after the meeting (below) using the correlation algorithm. 3.5 Pseudocode pipeline analyticsBelow is a pseudo-pipeline developed analytics. He describes the algorithm for processing one frame of video sequence. Pseudocode for a loop in a single frame Get the original image I
Accuracy and resource consumption of the algorithm are controlled by the frame rate (which is allowed a partial pass), permits the input frame, depth segmentation, region detection, region tracking, the search radius of the correlation method and other settings. 4. ImplementationVideoanaytical software for detecting and tracking objects based on the algorithm given in Section 3.5, implemented on two hardware platforms: (1) x86 c using SSE2 instructions for testing and (2) to signal processor for embedding directly into the camera or vidioenkoder (Fig. 8 ). Debugged batch production equipment developed analytics. Algorithms work on all platforms in real time at a resolution of 240 lines (standard definition) to 1080 lines (high definition).
Figure 8: The single-chip implementation of video analytics on a platform of Texas Instruments DaVinci TMS320DM6467. 5. TestingInternal tests conducted analytics on a special stand. Developed hardware and software for automated testing of video cameras and video servers with built-in analytics. The starting materials were used for testing video recorded with a real system of street surveillance. A set of TV spots corresponds to the scenario "sterile zone" [10] and contains:
A set of video footage consists of fragments recorded in different seasons, days, and also under different weather conditions. The total duration of the video - about 38 hours. Source - a standard camera with CCD sensor and an analog output PAL (720 x 576 x 25 fps). Digital video storage format MJPEG, data flow surplus - 40 Mbit / c - that brought the best quality of the recorded and live signals. Table. 1 Accuracy analytics scenario "sterile zone"
Layout TV spots produced by an independent team of experts in the field of security and surveillance. Experts pointed to the video moments of appearance and disappearance of the intruder. Similarly, were marked by the situation for potential false positives, thus effectively classify the errors in the debugging process. Expert markup or metadata, were recorded for each video-clip in the XML format for flexible programming texts using scripting languages. Setting algorithms, except for calibration of depth and area of interest for all the TV spots were identical. Not be tolerated "fit" parameters of the algorithms, such as sensitivity to specific videos. 7. Literature[1] Fredrik Nilsson. Intelligent network video. Understanding modern video surveillance systems, CRS Press, 2009 [2] Птицын Н.В. Встроенная видеоаналитика: ближайшие перспективы, Системы безопасности, №2, 2010, с.80-83,http://www.secuteck.ru/imag/ss-2-2010/ [3] Massimo Piccardi, Background subtraction techniques: a review, IEEE International Conference on Systems, Man and Cybernetics, 2004, p. 3099-3104, http://www.utsydney.cn/www-staffit/~massimo/BackgroundSubtractionReview-Piccardi.pdf [4] Bernd Jähne. Digital image processing, 5th revised and extended edition, Springer, 2002, http://books.google.com/books?id=qUeecNvfn0oC&lpg=PP1&dq=Bernd%20J%C3%A4hne.%20Digital%20image%20processing&pg=PP1#v=onepage&q&f=false [5] Maximilian Riesenhuber and Tomaso Poggio, Neural mechanisms of object recognition, Current opinion in neurobiology, 12, 2002, p. 162–168, http://cbcl.mit.edu/projects/cbcl/publications/ps/nb120204.pdf [6] Duane G. Albrecht, Wilson S. Geisler, Robert A. Frazor and Alison M. Crane, Visual cortex neurons of monkeys and cats: temporal dynamics of the contrast response function, Journal Neurophysiology, 88, 2002, p 888–913,http://jn.physiology.org/cgi/content/abstract/88/2/888 [7] Parisa Darvish Zadeh Varcheie, Michael Sills-Lavoie and Guillaume-Alexandre Bilodeau, A Multiscale region-based motion detection and background, Sensors, №10, 2010, ISSN 1424-8220, http://www.mdpi.com/1424-8220/10/2/1041/pdf [8] Eitan Sharon, Meirav Galun, Dahlia Sharon, Ronen Basri and Achi Brandt, Hierarchy and adaptivity in segmenting visual scenes, Nature, Vol. 442, August 2006, p. 810-813, http://www.wisdom.weizmann.ac.il/~meirav/nature04977.pdf [9] PETS: Performance evaluation of tracking and surveillance, http://www.hitech-projects.com/euprojects/cantata/datasets_cantata/dataset.html [10] i-LIDS User guide imagery library for intelligent detection systems, Publication №28/08 v2.0, Home Office Scientific Development Branch, p. 25-34, http://scienceandresearch.homeoffice.gov.uk/hosdb/publications/cctv-publications/28-08_-_i-LIDS_User_Guide.pdf
|












![Индивидуальное сопровождение объектов в момент встречи при помощи корреляционного алгоритма. Фрагмент видео из PETS [9]](/images/stories/Articles/multiscale-va/img7_1.png)
![Индивидуальное сопровождение объектов после встречи при помощи корреляционного алгоритма. Фрагмент видео из PETS [9]](/images/stories/Articles/multiscale-va/img7_2.jpg)



