High Definition Intelligent Network Video
Name:
Email:
Phone:
Message:

Embedded video analytics for object detection and tracking using multiscale features

Please, note that the article is automatically translated from Russian into English, so the translation may not be accurate.

Dr Nikolai Ptitsyn, Synesis, This e-mail address is being protected from spambots. You need JavaScript enabled to view it

GraphiСon 2010

Novel video analytics algorithms are presented enabling embedded motion detection and object tracking for CCTV systems.

The motion detection algorithm is based on the neurobiological mechanism of the primary visual cortex V1. A sequence of simple pixel operations is used including linear operators (weighted sum) and nonlinear operator (max, saturation).

The object tracking algorithm is a hybrid of two approaches: (1) time series analysis of motion detector regions and (2) space correlation between the current frame features and object model features.

Unique advantages of the present analytics pipeline include the efficiency on the high definition (HD) video stream and the ability to track low contrast overlapping objects against a dynamic background.

The embedded video analytics is implemented and deployed on different platforms including the Texas Instruments’s DSP.

A comprehensive testing environment was setup used to estimate the overall performance of the video analytics implementations. A fully embedded implementation on DSP has been i-LIDS approved both as a primary detection system for operational alert use and as an event based recording system in sterile zone monitoring applications.

1. Introduction

Automating the processing of streaming video systems, CCTV is an important scientific and engineering challenge. Called video analytics software based on algorithms for machine vision, which allows the detection, maintain, classify and / or to identify moving objects in the field of view camera without operator [1], pp. 287-312. A promising direction is the incorporation of video analytics algorithms directly into the camera or ip-device [2]. Compared with the server implementation, integrated analytics processing video, without the distortions introduced by analog or digital communications channel. This built-in video processing at a higher resolution and high frame rate potentially provides higher accuracy.

On the other hand, the known algorithms for video analytics is quite difficult to adapt for the built-in implementing chambers mass production because of their computational complexity. Hardware resources of single-chip camera platform (a set of processor instructions, the clock speed and memory size) is limited due to the limitations on heat dissipation and cost. Thus, the need to develop fundamentally new algorithms with better computational efficiency, especially if the camera uses a sensor in high definition.

The main task of embedded intelligence - to provide initial detection and tracking of target in the field of view. The result of such analysts are the coordinates, the trajectory and characteristics of objects. Other tasks, such as refining the classification, identification and mezhkamernoe support can be effectively executed on the server side.

2. The classical approach to the detection of objects

The general algorithmic approach to the detection of moving objects is to analyze the differences between the current frame and background model. Simplistically, this approach is called the subtraction of the background (background subtraction). Pixel belonging to an object (or background) is determined based on the deviation value (brightness) pixel in the current frame of a statistical model to estimate the background (see video). There are many methods for modeling the background image [3]. The most common methods for running Gaussian average (running Gaussian average) and Gaussian mixture (mixture of Gaussians):

  1. Running average method has been successfully operating in a sterile environment where the background remains stationary. In the case of a global volatile background, for example, movement of forest, bush and water, unimodal model can not effectively detect foreign objects.
  2. Method gassovskoy mixture is multimodal and can more accurately describe the statistics of variable background.However, the simulation of individual pixels, rather than their patterns, it does not provide a noticeable increase in accuracy of detection. Moreover, the method of Gaussian mixture is substantially more resource intensive than running average, and usually does not fit into the framework of computing power embedded processor.

At the stage of segmentation, the individual pixels are extracted from the background, are united in the regions with the help of morphological operations [4], pp. 481-495. Regions that match the size and shape, can be considered targets. As a rule, the phase morphology analysis of the largest computational cost. The complexity of the analysis increases nonlinearly with the increase in the area of regions and their number.

The main problems embedded analytics based on the classical approach are as follows:

  1. Nonlinear increase in algorithmic complexity with increasing the pixel size of the frame and / or targets. For example, most embedded algorithms work on the resolution of 160 x 120 320 x 240 pixels and can not practically be used in high-definition 1920 x 1080 pixels. Nonlinear increase in complexity due to the morphological operations of the detector by combining large regions. This restriction does not allow to use the potential of megapixel cameras, and increase the range of video analytics.
  2. Insufficient accuracy of detecting objects in the volatile backdrop in terms of weak contact.
  3. The high frequency of false alarms caused by natural manifestations of the Environment (clouds, wind, rain, snow, birds and insects).

3. The new algorithm

3.1 Neurobiological mechanism

The idea of the proposed algorithm is borrowed from nature, where the evolution of the nervous system of living beings achieved outstanding results in the field of video analysis [5]. Consider the functional diagram of the primary visual cortex V1 (Fig. 1) developed, in particular, primates and humans. The neural network is composed of cells of two types:

  1. Simple cells, marked with the letter S from the word simple, linear operation is carried out weighted summation (sum),ie a two-dimensional convolution:

form1,

where form2- Output signal at point form3, form4 - Weighting factor (convolution kernel) in a neighborhood form5,  form6 - Input signal at the point form7

  1. Complex cells, marked with the letter C from the English. complex, nonlinear operation is carried out selecting the maximum value (max):

form8.

img1

Figure 1: Image processing in the primary visual cortex: dash - signs corresponding to directional filter; S1, S2 - layers of simple cells; C1, C2 - complex layers of cells, the solid blue line - a weighted sum, the dashed green line - the operation of selecting the maximum.

At the entrance of the neural network the original image from the retina of the eye is processed by a simple cell S1. Simple cells implement directed filtering, which identifies the boundaries of certain orientations. The problem of directional filter - to select the characteristics of images that are invariant to illumination. Fig. 1 shows the four directions: horizontal, vertical and two diagonal. In the field of machine vision is widely used similar detectors boundaries based on the gradient [4], pp. 315-338.

At the level of complex cells C1 group is simple cells S1 in each direction and allocation of the maximum value. Complex cell possesses selectivity on the basis provides invariance bias input neuron in a neighborhood gang.

At the level of simple cells S2 produced a weighted sum of outputs of complex cells C1. As a result of summing the signals on different grounds from the level of S2, a composite characteristics that combine local data on several fronts. They are similar to symptoms of Haar, but at the expense of the previous layer complex cells better generalize the deformable shape.

At the level of complex cells C2 again applies a nonlinear operation is max, and are grouped together not only the outputs of prior S2 (position invariance) but also yields more of the lower layer C1 (scale invariance). Thus, at the level of C2 simple and complex features are combined to achieve invariance to shift and scale simultaneously.

On the other hand, the important features of visual cortex cells is (1) non-linear behavior in time and (2) nonlinear transfer contrast [6]. Dwell on the features (2): nonlinear transformation of contrast is expressed in saturation (the operation saturate) the output value of the trait on a certain level, which provides its valuation in terms of non-uniform contrast. Saturation appears as a simple and complex cells.

Naka-Rushton equation (Naka-Rushton) approximates a transfer characteristic of saturation (Fig. 2):

form9,

where form10 feature value at the entrance to the cell, form11 - Half-saturation point (in Fig. form12) form13 - The value of the output cells form14 - The maximum output value. Transfer characteristic can also be viewed as the activation function of neuron.

 

form15 Передаточная характеристика клетки зрительной коры
form10

Figure 2: Transfer characteristics of the cells of the visual cortex

Note the following features of the visual cortex V1 for their subsequent adaptation to machine analytics:

  1. Neural network generalizes the data and reduces their dimensionality by eliminating variations in position and scale. When distributing data from the retina to the visual cortex decreases the spatial detail of the original image, but increases the dimension attributes.
  2. The primary symptom is not the absolute brightness at each point in space, incorporated light-sensitive cell, and derived attributes obtained by using directional filter to the image.
  3. Simple and complex cells implement rationing of output signal due to saturation of the characteristic value at a certain constant level (operation saturate).
  4. Composite attributes are obtained by linear weighted summation operation sum on different grounds. Just have a Gaussian summation of the signals corresponding to the same grounds for their synthesis.
  5. Invariance to geometric transformations of translation and scaling is achieved by a nonlinear operation max. In this case, the horizontal aggregation of the outputs of one layer of cells provides a shift-invariant, and the vertical aggregation of the outputs of cells with one or more previous layers - the invariance under the scale.

Max and sum operations are used alternately and iteratively. The composition of the set of two simple operations (linear and nonlinear) generates a highly complex system of video analysis as a whole. Similar technique is used in block encryption algorithms to maximize the diffusion of data within the block for a minimum number of arithmetic operations.

3.2 Multiscale representation

Multiscale approach [4], pp. 125-142, has already been successfully used to detect motion [7] and the segmentation of complex scenes [8]. However, these algorithms today are not suitable for mass application in video surveillance cameras because of their computational complexity. In this paper we consider approaches that reduce resource use algorithms by several orders and apply algorithms in embedded systems, video analysis.

Consider a multi-scale representation of one feature (simple or composite) in the form of a pyramid in Fig. 3. These pyramids may be several for each trait, as well as for segmentation masks and other auxiliary data.

img3

Figure 3: Two phases of multi-scale video analysis: x, y - space coordinates of the image, s - the space scale.

Optimal algorithm for embedded video analytics includes two phases of processing multi-scale data:

  1. Direct distribution from detailed to rough representation (generalization).
  2. Back propagation from coarse to detailed representation (specification). In the reverse phase processing can be localized only area of detection and tracking.

During backpropagation each subsequent layer of the pyramid is calculated using simple linear and nonlinear operations discussed above. Sum operation can be a analog operations reduce the construction of Gaussian pyramid [4], pp. 137. Operationmax enhances the characteristics of and does not give them a "smeared" in a pyramid, and as discussed above, provides a shift-invariant and scale. Saturate operation is important for stable operation of the detectors in a non-uniform illumination and noise.

Optimal set of features and a sequence of operations, sum, max, and saturate depends on the specific functions performed by internal analysts:

  1. For detecting and tracking objects requires a relatively small set of features (1-4 symptoms). Are not normally required composite signs. On the other hand, increasing the number of features to simplify the algorithm of the statistical modeling of the background and increase the sensitivity of the detector in a changing background.
  2. For recognizing types of objects and identify good representation of important features. In practice, the algorithm can be videoanalitchiesky 8-64 composite trait.

Fig. Figure 2 shows the result of applying machine-Video using three attributes (brightness, and two saturated gradient). The algorithm is sure to register ducks in a changing background (water ripple with contrasting reflections). Used unimodal probability background model, multiscale segmentator, described below, and a primitive tracking algorithm (the binding regions in the trajectory without constructing a statistical model of the object).

Сопровождение уток на изменчивом и контрастном фоне. Полное видео можно посмотреть по адресу http://www.youtube.com/watch?v=PmJTnClUjYw

Figure 4: Tracking ducks in a changing and contrasting background. Full video can be viewed at http://www.youtube.com/watch?v=PmJTnClUjYw

3.3 Object segmentation

The mask of the object, calculated segmentatorom useful for calculating the characteristics of the object for his support and for more accurate modeling of the background.

As noted above, the major shortcoming of the classical approach is the high resource intensity of morphological operations on the stage of defining mask regions. Multiscale approach allows us to significantly improve the computational efficiency analytics by limiting the depth segmentation and / or use of the model object's shape. At the same dimensions of the detected objects, which can vary significantly in field of view, will not significantly affect the computational cost of the algorithm.

Consider in detail the proposed algorithmic approach. During the back-propagation signal in the pyramid (Fig. 3), formed a multiscale mask region. Segmentation is performed from coarse to detailed mask, and the process can be stopped after reaching the desired level of detail or after the exhaustion of the quota of computing resources. Input is a mask from the previous region, the difference between the signs of the current frame and background model on the current layer, as well as an optional model of the detected object. Output data is a mask of the region on the current layer. At each pixel is clarifying mask based on the method of binding in the pyramid [4], pp. 433-436.

Fig. 5 shows the result of multiresolution segmentatora based on the method of linking in a pyramid on several grounds: the top - the original frame, the processed video analytics with the trajectory followed by the person; below - four layer mask segmentation. Masks contain some minor errors in the form of points and imprecise boundaries, due to computational optimization of morphological operators.

Многомасштабное сегментирование: исходное изображения и маски с увеличивающейся детализацией

Многомасштабное сегментирование: исходное изображения и маски с увеличивающейся детализацией Многомасштабное сегментирование: исходное изображения и маски с увеличивающейся детализацией
Многомасштабное сегментирование: исходное изображения и маски с увеличивающейся детализацией Многомасштабное сегментирование: исходное изображения и маски с увеличивающейся детализацией

Figure 5: Multiscale segmentation: original image and the mask with increasing detail

3.4 Hybrid system maintenance

Algorithms for tracking (tracing) provide a trajectory of objects for more accurate recognition and assessment of dynamic features.

In classical implementations, integrated analytics popular following algorithmic approaches:

  1. The union of the regions identified by the detector in time for the set of consecutive frames to calculate the desired trajectory of the object. This is the easiest method in terms of implementation. Its main drawback is that the error detector regions lead to undesirable discontinuities of the trajectory. Likewise, the method can not maintain the objects in a group and go astray when a false regions. In slow motion or stop objects "grow into" in the background, and the tracking algorithm loses the target.
  2. Correlation methods involves construction of a statistical model, not only the background, but the object.The degree of similarity of the object at different points in the vicinity of the object to determine its most probable position [4], pp. 407. The advantage of this approach by comparing with the first is the ability to support a partially overlapping objects in the group, as well as a more stable operation with low-contrast or slow objects. The main disadvantage - significantly higher resource consumption. Correlation methods are ineffective at low frame rates and the strong variability of objects tracked.
  3. Optical Flow based on the assumption that the coverage is constant, and the shape and texture of the background and accompanied by an object does not change [4], pp. 385. The method of optical flow from a computational point is more effective than the correlation method, but inferior to him in stability in the face of noise and variability of the object.
Сопровождение изменчивого объекта при помощи алгоритма связывания регионов Сопровождение изменчивого объекта при помощи алгоритма связывания регионов Сопровождение изменчивого объекта при помощи алгоритма связывания регионов

    Figure 6: Maintenance of a volatile object by using the algorithm of binding regions

    More detailed methods for tracking objects in the video stream are described in [4], pp. 375-412.

    In this paper, we propose a hybrid method based on the approach (1) and (2). On the one hand, time series analysis found regions can effectively accompany isolated objects, including that significantly alter the shape (Fig. 6). On the other hand, the correlation method allows us to maintain the objects in the group (Fig. 7) or when the detector is not sensitive enough to locate the regions.

    Combining the results of algorithms based approaches (1) and (2) is produced by selecting the most likely estimate of the object. For approach (1) serves as a basis for evaluating the contrast of the region against the background, and to approach (2) - values of the correlation characteristics of the object area. Approach (1) prohibited by mutual overlap tracked objects.

    Multiscale representation of the mask and attributes of the object to significantly increase the computational efficiency of the correlation algorithm and increase the search radius, that is, solves the problems of the aperture [4], pp. 379.

    Индивидуальное сопровождение объектов в момент встречи при помощи корреляционного алгоритма. Фрагмент видео из PETS [9]

    Индивидуальное сопровождение объектов после встречи при помощи корреляционного алгоритма. Фрагмент видео из PETS [9]

    Figure 7: Individual support facilities at the time of the meeting (above) and after the meeting (below) using the correlation algorithm.

    3.5 Pseudocode pipeline analytics

    Below is a pseudo-pipeline developed analytics. He describes the algorithm for processing one frame of video sequence.

    Pseudocode for a loop in a single frame


    Get the original image I

    1. Construct a Gaussian pyramid of P I I
    2. Calculate the gradient of the pyramid and composite attributes P F 1, P F 2, ... of the P I (Section 3.2)
    3. Calculate the difference pyramid P D between pyramids of the current frame P F 1, P F 2, ... and the pyramids background P B 1, P B 2, ...
    4. Get a mask of regions of the P M P D by the segmentation algorithm (Section 3.3)
    5. Calculate the trajectories of moving objects hybrid tracking algorithm (Section 3.4) of P M (method of combining the regions) and the P I, P B 1, P B 2 and object models (method of correlation of symptoms)
    6. Update background model P B 1, P B 2, ..., by calculating the Gaussian average of the P F 1, P F 2, ..., and the moving objects are masked by P M
    7. Update the object model by calculating the Gaussian average of the P F 1, P F 2, ... to mask P M

    Accuracy and resource consumption of the algorithm are controlled by the frame rate (which is allowed a partial pass), permits the input frame, depth segmentation, region detection, region tracking, the search radius of the correlation method and other settings.

    4. Implementation

    Videoanaytical software for detecting and tracking objects based on the algorithm given in Section 3.5, implemented on two hardware platforms: (1) x86 c using SSE2 instructions for testing and (2) to signal processor for embedding directly into the camera or vidioenkoder (Fig. 8 ). Debugged batch production equipment developed analytics.

    Algorithms work on all platforms in real time at a resolution of 240 lines (standard definition) to 1080 lines (high definition).

    Однокристальная реализация видеоаналитики на платформе Texas Instruments DaVinci TMS320DM6467. Размер платы, 80 х 55 х 14 мм, соответствует габаритам банковской карты

    Figure 8: The single-chip implementation of video analytics on a platform of Texas Instruments DaVinci TMS320DM6467.

    5. Testing

    Internal tests conducted analytics on a special stand. Developed hardware and software for automated testing of video cameras and video servers with built-in analytics.

    The starting materials were used for testing video recorded with a real system of street surveillance. A set of TV spots corresponds to the scenario "sterile zone" [10] and contains:

    • 432 cases of violation of the perimeter (moving at different speeds, step, run, somersault, crawl, in camouflage overalls, with a staircase in a group and on an unusual path);
    • about 500 cases for potential false positives (a sharp change in illumination, movement of shadows, camera shake, small mammals, birds, insects on the lens, packed snow, rain, fog).

    A set of video footage consists of fragments recorded in different seasons, days, and also under different weather conditions. The total duration of the video - about 38 hours. Source - a standard camera with CCD sensor and an analog output PAL (720 x 576 x 25 fps). Digital video storage format MJPEG, data flow surplus - 40 Mbit / c - that brought the best quality of the recorded and live signals.

    Table. 1 Accuracy analytics scenario "sterile zone"

    Weight Parameter

    Sensitivity
    Specificity

    Weighted average accuracy

    Formula form16 form17 form18 form19
    Quick Alert 0.65 1.00 1.00 1.00
    Event Registration 75.00 1.00 1.00 1.00

    Layout TV spots produced by an independent team of experts in the field of security and surveillance. Experts pointed to the video moments of appearance and disappearance of the intruder. Similarly, were marked by the situation for potential false positives, thus effectively classify the errors in the debugging process. Expert markup or metadata, were recorded for each video-clip in the XML format for flexible programming texts using scripting languages.

    Setting algorithms, except for calibration of depth and area of interest for all the TV spots were identical. Not be tolerated "fit" parameters of the algorithms, such as sensitivity to specific videos.

    -Video had to register the breach within 10 seconds. Delay was considered a pass violation, triggering a false negative (counter).

    Repeated detection of the break trajectory was considered false positive operation (counter b). Thus, the real test will assess the quality of not only the detector, and tracking system.

    Internal testing showed similar results on the x86 platform and the signal processor:

    The number of positives istinnopolozhitelnyh
    (No error): a = 432
    The number of false positive alarms
    (Error type I): b = 0
    The number of false negatives positives
    (Type II error): c = 0
    The accuracy of the scenario "sterile zone" designed by the method of i-LIDS [10].The calculation is shown in Table. 1. The values of the weighted average accuracy for operational roles anxiety and recording events coincided and were ideal: F1 = 1.000.

    Video analytics as was the external independent testing on another set of TV spots, unknown designers. The values of the weighted average accuracy for operational roles anxiety and recording events were F1 = 0.997.

    Precision value, averaged over the results of internal and external testing is F1 = 0.999.

    6. Conclusion

    The article is an example of successful adaptation neurobiological mechanism of a living organism to intelligent devices, closed circuit television. Developed and implemented new algorithms for video analysis to detect the traffic, the segmentation of the hybrid tracking.

    A distinctive feature of the developed algorithms is the use of multi-scale features in the form of a pyramid. Application of several pyramid features enabled at the same time refuse to intensive algorithms, multimodal probabilistic modeling of the background and increase the accuracy of the detector.

    The accuracy of automatic recognition of situations in equipment production scenario "sternal zone" is F1 = 1.000 based on internal testing and F1 = 0.997 on the independent test on the basis of technique i-LIDS.

    Promising directions for future work are the study of composite features and implementation on the basis of their more accurate classification of objects.

    7. Literature

    [1]       Fredrik Nilsson. Intelligent network video. Understanding modern video surveillance systems, CRS Press, 2009

    [2]       Птицын Н.В. Встроенная видеоаналитика: ближайшие перспективы, Системы безопасности, №2, 2010, с.80-83,http://www.secuteck.ru/imag/ss-2-2010/

    [3]     Massimo Piccardi, Background subtraction techniques: a review, IEEE International Conference on Systems, Man and Cybernetics, 2004, p. 3099-3104, http://www.utsydney.cn/www-staffit/~massimo/BackgroundSubtractionReview-Piccardi.pdf

    [4]     Bernd Jähne. Digital image processing, 5th revised and extended edition, Springer, 2002, http://books.google.com/books?id=qUeecNvfn0oC&lpg=PP1&dq=Bernd%20J%C3%A4hne.%20Digital%20image%20processing&pg=PP1#v=onepage&q&f=false

    [5]     Maximilian Riesenhuber and Tomaso Poggio, Neural mechanisms of object recognition, Current opinion in neurobiology, 12, 2002, p. 162–168, http://cbcl.mit.edu/projects/cbcl/publications/ps/nb120204.pdf

    [6]     Duane G. Albrecht, Wilson S. Geisler, Robert A. Frazor and Alison M. Crane, Visual cortex neurons of monkeys and cats: temporal dynamics of the contrast response function, Journal Neurophysiology, 88, 2002, p 888–913,http://jn.physiology.org/cgi/content/abstract/88/2/888

    [7]     Parisa Darvish Zadeh Varcheie, Michael Sills-Lavoie and Guillaume-Alexandre Bilodeau, A Multiscale region-based motion detection and background, Sensors, №10, 2010, ISSN 1424-8220, http://www.mdpi.com/1424-8220/10/2/1041/pdf

    [8]     Eitan Sharon, Meirav Galun, Dahlia Sharon, Ronen Basri and Achi Brandt, Hierarchy and adaptivity in segmenting visual scenes, Nature, Vol. 442, August 2006, p. 810-813, http://www.wisdom.weizmann.ac.il/~meirav/nature04977.pdf

    [9]     PETS: Performance evaluation of tracking and surveillance, http://www.hitech-projects.com/euprojects/cantata/datasets_cantata/dataset.html

    [10]  i-LIDS User guide imagery library for intelligent detection systems, Publication №28/08 v2.0, Home Office Scientific Development Branch, p. 25-34, http://scienceandresearch.homeoffice.gov.uk/hosdb/publications/cctv-publications/28-08_-_i-LIDS_User_Guide.pdf