Face Positioned Driver Drowsiness Detection Using Multistage Adaptive 3D Convolutional Neural Network

Accidents due to driver drowsiness are observed to be increasing at an alarming rate across all countries and it becomes necessary to identify driver drowsiness to reduce accident rates. Researchers handled many machine learning and deep learning techniques especially many CNN variants created for drowsiness detection


Introduction
Road accidents are dangerous to human community.According to National Highway Traffic Safety Administration report, USA, 24.5% accidents and Ministry of Road Transport and Highways report, India, 27% accidents are caused by driver fatigues.Risk of accident increases by four to five times in almost all countries [11].Regular accidents especially in densely populated country like India heavily affects people safety.As a necessary consequence, research towards detecting driver drowsiness is significant.To prevent drowsiness, behavioural strategies including drinking tea or coffee, stopping for a little nap, and riding with a passenger are typically advised.These precautions, however, might not work if a motorist is not aware that he or she is tired [12].Drowsiness is usually indicated by excessive yawning, bowing, or head sliding, as well as persistent blinking and there are variety of methods to measure the driver drowsiness level but when it comes to Indian conditions, most of the research work does not produce good results.For example, during yawning, Indians have the habit of placing their hands to cover the mouth as not to disturb others and this habit is reflected in most of the Indian drivers during driving, the works in conventional models does not consider this specific feature related to Indian conditions.In this work most of the drowsiness features are considered and a global driver drowsiness detection model using 3D Deep Convolution Neural network.Driver Drowsiness is linked to psychological and physiological changes of driver such as blink rate, pulse rate, anxiety, and so on.Generally, these methods fall under four different categories.Image-based measures; vehicle-based measures; biological-based measures; hybrid-based measures [2].In image-based measures, drowsiness symptoms can be seen and recorded using cameras or visual sensors.These measures are further divided for motion of lips, head movements and frequency of eyes closures [25].Although Computer Vision based sleepiness detection techniques are most useful, their effectiveness is influenced by changes in lighting, facial expressions, and stance.However, with the advancement of deep learning, sleepiness detection methods based on convolutional neural networks (CNNs) are now a stateof-the-art method [23].
Deep belief network [26] is introduced to identify the facial landmarks with own dataset collected across different ages, genders and under various illumination conditions and 68 facial landmarks are identified.Using different deep learning architectures like ResNet50, VGG16, Inception V3 and VGG19, various varieties of DDD systems [8,[21][22] are introduced but when we combine methods to work sequentially to improve the performance, feature fusion issue arises, resulting in losing important facial elements [4].To avoid feature fusion losses, a deep cascaded convolution neural network that identifies exact features of the facial regions is trained offline and blended for online monitoring [4], Spatial and Temporal space is explored for the facial regions after which bilinear feature fusion [3] to take frame level annotations to LSTM for drowsiness detection [7].These methods operate under low time responses, especially when comes to Indian conditions.A residual 3D CNN architecture is introduced and compared with similar 2D networks to present the advantages of Spatio Temporal learning [27].
A conditional spatio-temporal data representation using 3D-DCNN framework is introduced for learning through direct inputs for driver drowsiness detection without considering the real time online monitoring of driver states [24].Through online monitoring, a new 3D Conditional Generative Adversarial Network and Two-Level Attention Bidirectional Long Short-Term Memory (3DcGAN-TLABiLSTM) [10] is introduced and achieves a reasonable frame processing time compared to 36.9 fps in 3D-DCNN.Even though there are enough varieties in CNN, eye and mouth conditions in maximum all the DDD models have low performance as they occupy smaller part of the frames, special features for eye and mouth are used in ensemble Multi-CNN Deep Learning model [5,1].R-CNN is introduced as an alternative model through which 93 % of accuracy is received [6].All the above discussed works have high intrusion, low robustness, and low reliability which require huge processing power.This provides a huge scope and demand for Driver drowsiness detection research.
We propose a multi-stage adaptive 3D-CNN for face positioned DDD System and the proposed innovations are as follows:

Preliminaries
Convolutional Neural Network (CNN) is initially introduced [15] as a weighted filter model with multiple connected layers.CNN is popularly used in vision-based tasks such as Image Classification, Recognition and Object Detection.Its design structure has characteristics of high scaling, high degree shifting and misinterpretation of invariances such as defined segmented area in convolution process (temporary space where convolution takes place), Weight sharing and Spatio-Temporal (ST) Sampling.As we use local connection and weight sharing in CNNs, locally minimal meaningful features are extracted and this property of CNN makes it a preliminary feature detector of a small part of image in a set of images.The major part of convolution relays on identifying the feature map and its unit position (m,n) in 2D convolution is given by, We propose a multi-stage adaptive 3D-CNN for face positioned DDD System and the proposed innovations are as follows: 1.A three-stage model with a non-intersection over union suppression technique to identify five-points (Left eye, Right eye, Nose, left end of mouth and right end of mouth) along with the bounding boxes by a feather-like CNN architecture carefully designed to easily operate with frames (Stage 1).
2. A separate adaptive learning (Learning from samples irrespective of its class) for understanding the state of driver is designed using a multistage adaptive 3D-CNN to increase drowsiness classification performance (Stage 2 to 5).

Preliminaries
Convolutional Neural Network (CNN) is initially introduced [15] as a weighted filter model with multiple connected layers.CNN is popularly used in vision-based tasks such as Image Classification, Recognition and Object Detection.Its design structure has characteristics of high scaling, high degree shifting and misinterpretation of invariances such as defined segmented area in convolution process (temporary space where convolution takes place), Weight sharing and Spatio-Temporal (ST) Sampling.As we use local connection and weight sharing in CNNs, locally minimal meaningful features are extracted and this property of CNN makes it a preliminary feature detector of a small part of image in a set of images.
The major part of convolution relays on identifying the feature map and its unit position (m,n) in 2D ` a ij mn =α�∑ ∑ �x ab w ij ab �+ B ij where α is activation function, x is latent information (unit pixel value) of position (m,n) in i th feature map corresponds to j th layer, w is the kernel associated with the local feature map.The feature height and width is H and W and a,b are its initial values.For each feature map generated at each layer, a different bias Bij is associated.The dimensions of feature map are reduced by pooling with spatial adjacent values generated in previous feature maps.The final feature does not only contain local information, it can be combined with another local spatial neighbours to describe whole image.Even though the features from 2D-CNN are robust and has good impact in sequential data applications, it considers only the spatial data that would not be capable to produce good results for time dimension oriented dynamic applications.To where α is activation function, x is latent information (unit pixel value) of position (m,n) in i th feature map corresponds to j th layer, w is the kernel associated with the local feature map.The feature height and width is H and W and a,b are its initial values.For each feature map generated at each layer, a different bias B ij is associated.The dimensions of feature map are reduced by pooling with spatial adjacent values generated in previous feature maps.The final feature does not only contain local information, it can be combined with another local spatial neighbours to describe whole image.Even though the features from 2D-CNN are robust and has good impact in sequential data applications, it considers only the spatial data that would not be capable to produce good results for time dimension oriented dynamic applications.To process the additional temporal information in sequential data, 3D-CNN is introduced [13].A 3D Feature map is used to convolve with 3D volume of combined set of image inputs to create a latent 3D feature map for the next layer.Through this method an additional temporal information is captured and its unit position (m,n,t) in 3D convolution is given by process the additional temporal information in sequential data, 3D-CNN is introduced [13].A 3D Feature map is used to convolve with 3D volume of combined set of image inputs to create a latent 3D feature map for the next layer.Through this method an additional temporal information is captured and its unit position (m,n,t) in 3D convolution is given by where α is activation function, x is latent information (unit pixel value) of position (m,n,t) in i th feature map corresponds to j th layer, w is the 3D kernel associated with the local feature map.
where α is activation function, x is latent information (unit pixel value) of position (m,n,t) in i th feature map corresponds to j th layer, w is the 3D kernel associated with the local feature map.

Architecture
The proposed architecture shown in Figure 1

Face Positioning
Many CNN based algorithms are available for face positioning.We notice several performance limitations due to following reasons: 1.Few filters while performing convolution may fail to differentiate input parameters as lack of diversity.2. Huge filter size is used and it is not needed, as this problem falls in only two classes (1.Face and 2.Not Face), filter size can be reduced, we fixed it as 3 x 3, this will reduce total computational complexity as we use this stage for identifying driver is available in the frame in order to process the frame for drowsiness detection.The 2D-CNN architecture used for positioning face is given in Figure 3.

Spatio-Temporal (ST) State Learning Phase
In this section, we describe about state learning model designed using 3D-CNN.3D-CNN is used to extract the ST data from the sequence of frames where W lr , H lr and D lr are width, height and depth of the local receptive field, and v, w and b are input, weight, and its associated bias.The activation value a, triggers the hidden unit functions and ρ is the local activation function used in convolution, we use Rectified Linear Units (ReLU) [14] for all local activations in proposed 3D-CNN.As discussed in earlier sections our designed 3D-CNN extracts spatial and temporal values simultaneously then conveys to state understanding and fusion model to make Conditional feature for drowsiness classification model.

State Understanding Phase
The goal of this section is to make models to understand driver physiological states and environmental conditions like night time, day time, wearing glasses and other important facial elements of the driver.This will help us to develop an integrated adaptive state learning network according to state conditions.We hypocrite that the data collected (video) is associated with the state conditions and driver drowsiness.These are explained clearly in training and inference  We use one-hot vector to define the states and its facial conditions, assigned one-hot vectors are given in Table 1.
We assume that linear kernels will face difficulty in handling ST data due to highly overlapped distributions and so we use fully connected Neural Network (NN) to deal with ST data carefully.The predictions where Wlr, Hlr and Dlr are width, height and depth of the local receptive field, and v, w and b are input, weight, and its associated bias.The activation value a, triggers the hidden unit functions and ρ is the local activation function used in convolution, we use Rectified Linear Units (ReLU) [14] for all local activations in proposed 3D-CNN.As discussed in earlier sections our designed 3D-CNN extracts spatial and temporal values simultaneously then conveys to state understanding and fusion model to make Conditional feature for drowsiness classification model.

State Understanding Phase
The goal of this section is to make models to understand driver physiological states and environmental conditions like night time, day time, wearing glasses and other important facial elements of the driver.This will help us to develop an integrated adaptive state learning network according to state conditions.We hypocrite that the data collected (video) is associated with the state conditions and driver drowsiness.and 5. other special condition model Osc.We use onehot vector to define the states and its facial conditions, assigned one-hot vectors are given in Table 1.We assume that linear kernels will face difficulty in handling ST data due to highly overlapped distributions and so we use fully connected Neural Network (NN) to deal with ST data carefully.The predictions of the models are represented as where Ȏ ∈ {Ȏgn,Ȏh,Ȏm,Ȏe,Ȏsc} are the predictions from input data x, O ∈ {Ogn,Oh,Om,Oe,Osc} are the input dimension representation of state with conditions and θ ∈ {θgn,θh,θm,θe,θsc} are the parameters of its associated model that is given in fully connected network architecture of understanding model.We design all the models with three hidden layers and one output layer.
The operative function of these models is given by op=f where st is Spatial Temporal values derived from state learning using 3D-CNN.Whl3, Whl2, and Whl1 are weights of hidden layers and Wo is weight of output layer, bhl3, bhl2, and bhl1 are the bias associated with the hidden layers and bO is the bias in output layer.fhl1, fhl2 and fhl3 are the activation functions of the hidden layers and fop is the final activation function in output layer.Sub models learns through back propagation, intends to identify a condition for given ST data st, then calculates the difference between predicted and fixed annotations to train network parameters.

Feature Fusion Phase
Feature fusion model is designed to learn collection of adaptive-Conditional feature representation from the ST representation st and state conditional annotations Ȏ∈{Ȏgn,Ȏh,Ȏm,Ȏe,Ȏsc}. Using ST data extracted from 3D-CNN, st ∈    *   *   and predicted state conditions of sub models Ȏ, fusion model identifies the collection of adaptive-Conditional feature representation γ.This γ vector is calculated by multiplicative interaction approach [9,[17][18][19].The highly dependent and relevant features are identified by multiplicative interaction among the feature maps (element-wise).As the proposed fusion model requires to learn from two form of sources, to handle this, a training procedure defined by Hong et al [9] is adopted.γ corresponding to Fusion model lmfu is given as, where bfu∈S d*l is the bias of fusion model and ⊙ represent element wise multiplication.Wfu∈S H*d , Wfea∈S d*W st H st D st , W gn , W h , W m , W e and W sc are the weights.Here H and d are hidden units and its total count in this fusion layer.γ is unnormalized adaptive Conditional feature representation. ( where Ȏ ∈ {Ȏ gn , Ȏ h , Ȏ m , Ȏ e , Ȏ sc } are the predictions from input data x, O ∈ {O gn ,O h ,O m ,O e ,O sc } are the input dimension representation of state with condi-tions and θ ∈ {θ gn , θ h , θ m , θ e , θ sc } are the parameters of its associated model that is given in fully connected network architecture of understanding model.We design all the models with three hidden layers and one output layer.The operative function of these models is given by Ȏ e =lm e (a; θ e ), O e ∈ S O e * 1 , where Ȏ ∈ {Ȏgn,Ȏh,Ȏm,Ȏe,Ȏsc} are the predictions from input data x, O ∈ {Ogn,Oh,Om,Oe,Osc} are the input dimension representation of state with conditions and θ ∈ {θgn,θh,θm,θe,θsc} are the parameters of its associated model that is given in fully connected network architecture of understanding model.We design all the models with three hidden where where Wlr, Hlr and Dlr are width, height and depth of the local receptive field, and v, w and b are input, weight, and its associated bias.The activation value a, triggers the hidden unit functions and ρ is the local activation function used in convolution, we use Rectified Linear Units (ReLU) [14] for all local activations in proposed 3D-CNN.As discussed in earlier sections our designed 3D-CNN extracts spatial and temporal values simultaneously then conveys to state understanding and fusion model to make Conditional feature for drowsiness classification model.

State Understanding Phase
The goal of this section is to make models to understand driver physiological states and environmental conditions like night time, day time, wearing glasses and other important facial elements of the driver.This will help us to develop an integrated adaptive state learning network according to state conditions.We hypocrite that the data collected (video) is associated with the state conditions and driver drowsiness.

Feature Fusion Phase
Feature fusion model is designed to learn collection of adaptive-Conditional feature representation from the ST representation st and state conditional annotations Ȏ∈{Ȏ gn , Ȏ h , Ȏ m , Ȏ e , Ȏ sc }.Using ST data extracted from 3D-CNN, st ∈ and predicted state conditions of sub models Ȏ, fusion model identifies the collection of adaptive-Conditional feature representation γ.This γ vector is calculated by multiplicative interaction approach [9,[17][18][19].The highly dependent and relevant features are identified by multiplicative interaction among the feature maps (element-wise).As the proposed fusion model requires to learn from two form of sources, to handle this, a training procedure defined by Hong et al [9]

Feature Fusion Phase
Feature fusion model is designed to learn collection of adaptive-Conditional feature representation from the ST representation st and state conditional annotations Ȏ∈{Ȏgn,Ȏh,Ȏm,Ȏe,Ȏsc}. Using ST data extracted from 3D-CNN, st ∈    *   *   and predicted state conditions of sub models Ȏ, fusion model identifies the collection of adaptive-Conditional feature representation γ.This γ vector is calculated by multiplicative interaction approach [9,[17][18][19].The highly dependent and relevant features are identified by multiplicative interaction among the feature maps (element-wise).As the proposed fusion model requires to learn from two form of sources, to handle this, a training procedure defined by Hong et al [9] is adopted.γ corresponding to Fusion model lmfu is given as, where bfu∈S

Feature Fusion Phase
Feature fusion model is designed to learn collection of adaptive-Conditional feature representation from the ST representation st and state conditional annotations Ȏ∈{Ȏgn,Ȏh,Ȏm,Ȏe,Ȏsc}. Using ST data extracted from 3D-CNN, st ∈    *   *   and predicted state conditions of sub models Ȏ, fusion model identifies the collection of adaptive-Conditional feature representation γ.This γ vector is calculated by multiplicative interaction approach [9,[17][18][19].The highly dependent and relevant features are identified by multiplicative interaction among the feature maps (element-wise).As the proposed fusion model requires to learn from two form of sources, to handle this, a training procedure defined by Hong et al [9] is adopted.γ corresponding to Fusion model lmfu is given as, where b fu ∈ S d*l is the bias of fusion model and ⊙ represent element wise multiplication.W fu ∈ S H*d , W fea ∈, are the weights.Here H and d are hidden units and its total count in this fusion layer.γ is unnormalized adaptive Conditional feature representation.This six-way inner tensor product helps in identifying the correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υ i is given by This six-way inner tensor product helps in identifying the correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υi is given by where υi is normalized i th element feature and γi is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system  where γj is t ∈{0,1} acts as an positioning system o net) uses γdet=1, γbox γdet=1,γbox=0.5,γloc=1(SGD) is deployed t then final facial pos These stacked image that are given as inpu network.The propos network has two ma defined in Equatio understanding mode very important to co optimization to prod equations, the ob proposed system is g

Training and Augmentation
where λ is the understanding and de function optimizes al system simultaneous modules to train, we simultaneously and groups that impact architecture, the ove positioning 2. ST lea 4. Fusion and 5. De , (10) where υ i is normalized i th element feature and γ i is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as R Detect =lm detect (υ;θ detect ), (11) where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system

Training and Augmentation
CNN training in face positioning phase has three objectives, 1. facial classification, 2. regression bounding lines, 3. identifying facial landmarks, to compute results for facial classification, we have two class units (1.face and 2. non-face), for regression bounding lines, we have four class units (  10 .There exists a scenario for which Loss i box and Loss i loc can be fixed as zero, for example, when the input is only background image (case: without a driver) then there is no need to find bounding boxes and facial positions and these two losses can be set to 0. We handle this situation by introducing a new indicator and the learning rate of face positioning is given by, min ∑ ∑ γ j j ∈ {det,box,loc} β i j N i Loss i j (16) where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face , where R Detect is the final output for parameter set θ detect for final detection model.R Detect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in nondrowsy unit then the driver is in normal condition.
For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system 3  10 .There exists a scenario for which Loss i box and Loss i loc can be fixed as zero, for example, when the input is only background image (case: without a driver) then there is no need to find bounding boxes and facial positions and these two losses can be set to 0. We handle this situation by introducing a new indicator and the learning rate of face positioning is given by, min ∑ ∑ γ j j ∈ {det,box,loc} β i j N i Loss i j (16) where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face , where is expected outcome of input sample and is Soft-Max cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system

Training and Augmentation
CNN training in face positioning phase has three objectives, If the corresponding values from the units are high, then the proposed system is likely to adhere that class unit.The error minimization is carried out for all 3 stages (I-net, P-net, O-net) to repeatedly train the network and so we use loss functions for error measurements.For facial binary classification, we use cross entropy loss function given in equation (13).
For regression bounding lines and facial landmarks, we adopt Euclidean loss function given in equation ( 14) and (15).
correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υi is given by where υi is normalized i th element feature and γi is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system fully connected layer.If the corresponding values from the units are high, then the proposed system is likely to adhere that class unit.The error minimization is carried out for all 3 stages (I-net,

Training and Augmentation
where  (16) where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face (13) correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υi is given by where υi is normalized i th element feature and γi is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system fully connected layer.If the corresponding values from the units are high, then the proposed system is likely to adhere that class unit.The error minimization is carried out for all 3 stages (I-net,

Training and Augmentation
where  (16) where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face (14) correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υi is given by where υi is normalized i th element feature and γi is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system

Training and Augmentation
CNN training in face positioning phase has three objectives, 1. facial classification, 2. regression bounding lines, 3. identifying facial landmarks, to compute results for facial classification, we have two class units (1.face and 2. non-face), for regression bounding lines, we have four class units (1.Left, 2. Right, 3. Length, 4. Width) representing the bounding boxes (candidate window) and for facial landmarks, we have 10 class units (1.Left eye, 2. Not-left eye, 3. Right eye, 4. Not-right eye, 5. Nose, 6. Not-nose, 7. Left end of lips, 8.Not left end of lips, 9. Right end of lips and 10. fully connected layer.If the corresponding values from the units are high, then the proposed system is likely to adhere that class unit.The error minimization is carried out for all 3 stages (I-net, where Pi is probability of being face for input xi and Oi det ∈{0,1} is the expected outcome.O i box , O' i box are expected and actual regression bounding line targets such that O i box ∈R 4 .Similarly, O i loc , O' i loc are expected and actual facial positions such that O i loc ∈R 10 .There exists a scenario for which Loss i box and Loss i loc can be fixed as zero, for example, when the input is only background image (case: without a driver) then there is no need to find bounding boxes and facial positions and these two losses can be set to 0. We handle this situation by introducing a new indicator and the learning rate of face positioning is given by, min ∑ ∑ where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face , where P i is probability of being face for input , O i ' loc are expected and actual facial positions such that O i loc ∈ R 10 .There exists a scenario for which Loss i box and Loss i loc can be fixed as zero, for example, when the input is only background image (case: without a driver) then there is no need to find bounding boxes and facial positions and these two losses can be set to 0. We handle this situation by introducing a new indicator and the learning rate of face positioning is given by, This six-way inner tensor product helps in identifying the correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υi is given by where υi is normalized i th element feature and γi is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system Not right end of lips) in the final output layer of fully connected layer.If the corresponding values from the units are high, then the proposed system is likely to adhere that class unit.The error minimization is carried out for all 3 stages (I-net,

Training and Augmentation
where box and Loss i loc can be fixed as zero, for example, when the input is only background image (case: without a driver) then there is no need to find bounding boxes and facial positions and these two losses can be set to 0. We handle this situation by min ∑ ∑ γ j j ∈ {det,box,loc} β i j N i Loss i j (16) where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, min ∑ ∑ γ j j ∈ {det,box,loc} β i j N i Loss i j +min(P d ,θ su ,θ f ,θ det ) where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face (16) where γ j is the rate of importance and β i j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and P-net) uses γ det =1, γ box =0.5, γ loc =0.5 and O-net uses γ det =1, γ box =0.5, γ loc =1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, This six-way inner tensor product helps in identifying the correlation among defined sub-classes.Figure 4 shows input images, its ST values and Conditional feature representations.From this procedure, we receive the resultant nearer to zero, which may produce bad results and may sometimes violates the computational procedure by exceeding its range.A SoftMax normalization procedure is used to solve these issues and preserve the dependencies among ST data and outputs of sub models.Normalized γ represented as υi is given by where υi is normalized i th element feature and γi is unnormalized combined feature at i th element.Now υ represents the Conditional feature over ST data and outputs of sub models, which is an input for final detection model.

Feature Fusion Phase
Fusion phase provides with cluster of Conditional feature υ, contains feature information about driver expressions of all classes and the final driver drowsiness identification model of the proposed system is constructed using Conditional feature υ.A similar fully connected NN used in state understanding phase is used again upon fusion model to derive the final two class (Drowsy and Non-Drowsy) output is given as where RDetect is the final output for parameter set θdetect for final detection model.RDetect falls under two class units (output layer units in this fully connected layer are 1.drowsy and 2. non-drowsy) and we implement SoftMax activation function to measure likeliness of output units for every input samples.If SoftMax function returns a high value in the drowsy unit, then the driver is sleepy and if it returns a high value in non-drowsy unit then the driver is in normal condition.For better results, a common optimization is carried out for both fusion and detection model to minimize the loss by comparing annotated form of final detection with features and it is given by where   ′ is expected outcome of input sample and   is SoftMax cross entropy loss function to recalculate the deviations with fixed number of iterations i and embedded to all the previous laid models in the proposed system Not right end of lips) in the final output layer of fully connected layer.If the corresponding values from the units are high, then the proposed system is likely to adhere that class unit.The error minimization is carried out for all 3 stages (I-net, where box and Loss i loc can be fixed as zero, for example, when the input is only background image (case: without a driver) then there is no need to find bounding boxes and facial positions and these two losses can be set to 0. We handle this situation by min ∑ ∑ γ j j ∈ {det,box,loc} β i j N i Loss i j (16) where γj is the rate of importance and βi j ∈{0,1} acts as an indicator.The proposed face positioning system of first two stages (I-net and Pnet) uses γdet=1, γbox=0.5, γloc=0.5 and O-net uses γdet=1,γbox=0.5,γloc=1.Stochastic Gradient Descent (SGD) is deployed to train these three stages and then final facial positions are stacked separately.These stacked images form another image pyramid that are given as input to 3D adaptive state learning network.The proposed 3D adaptive state learning network has two main Objective Functions (OF's) defined in Equation ( 7) and ( 12) of state understanding model and detection model.It is very important to consider these performances for optimization to produce better results.From these equations, the objective function of whole proposed system is given by, min ∑ ∑ γ j j ∈ {det,box,loc} β i j N i Loss i j +min(P d ,θ su ,θ f ,θ det ) where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face (17) where λ is the balancing parameter for understanding and detection phases.This objective function optimizes all five modules of the proposed system simultaneously.Even though we have five modules to train, we are not training all modules simultaneously and we train according to the groups that impact output of proposed system architecture, the overall output depends on 1. Face positioning 2. ST learning, 3. State understanding, 4. Fusion and 5. Detection.So first we train face positioning model then ST Learning and State understanding followed by fusion and detection models.
For the objective of driver drowsiness identification from video, the proposed system has ST learning that creates ST data used for state understanding to create models and sub-models, then by using the ST data and State understanding information's, adaptive Conditional feature is created by fusion model and using adaptive Conditional feature, drowsiness is detected.
Overfitting is the general issue in all learning models especially in most of the unsupervised learning designs.Generally overfitting can be reduced by transforming the dataset in different ways without losing output labels and inducing the transformed knowledge to the learning system.In our work, the stacked images from face positioning phase are rotated (horizontal transformation).Since the computation for this rotation is significantly low, we created a new data set with low computational cost.The original images and rotated images are transformed using gaussian filter and the transformations are used to additionally train our proposed system to fix the patches in training phase.Further, we generate four different variations of input sample through image pyramid technique to include in the experiments is shown in Figure 5, thus more additional input samples are created with horizontal transformation and image pyramid technique.Without implementing these two augmentation techniques, our proposed system suffers from overfitting and low convergence rate and this is further discussed in ablation experiments.

Datasets
In this section, we evaluate our proposed system with two datasets 1. NTHU-DDD Dataset (Benchmark dataset) and 2. KEC-DDD Dataset (own collection) and compared using major performance metrics.
During emerging days of driver drowsiness detection studies, most of the researchers used private datasets, Now-a-days many driver drowsiness detection datasets are available for public access.We requested access for NTHU-DDD dataset through end user license agreement given by NTHU and downloaded the dataset from their FTP server to implement our proposed driver drowsiness detection system.

Experimental Results
During   We used 2Dconv layer function for face positioning, 3Dconv layer function for ST learning, State understanding and Detection phases, so it is obvious to compare the results with other models at face positioning stage and drowsiness detection stage.We demonstrate our results with the decided evaluation set of KEC-DDD and NTHU-DDD dataset.Performance of the proposed system is measured at face positioning stage and final drowsiness detection stage and compared its results with other models.The positioning phase validation results is presented in Table 2.The training accuracies during state understanding phase are shown in Figure 7 separately for all sub models.The validation accuracies of state understanding models are given by y/x , where y is correct classifications and x is input samples of the respective sub models.Validation results of state understanding phase which composes of five models, 1. Glasses and normal conditions lm gn , 2. Head conditions lm h , 3. Eye Conditions lm e , 4.
Mouth Conditions lm m and 5. Special conditions lm sc are shown in Table 3.The average accuracies are calculated by taking mean of respective heads, so that the total classification numbers can be neutralized.Final average accuracy of state understanding phase is 0.888 for KEC-DDD and 0.866 for NTHU-DDD dataset.It is also observed that high accuracies are found in lm gn and comparatively low accuracies are found in lm e , this gaps between sub models are matched by bias of ST learning to support final detection.Output of the state understanding phase is always based on size of the target element, as mouth and eye are small compared to glasses and head region in entire frame in both datasets, models lm gn and lm h may be overfitted and produces good accuracies as shown in Table 3.The overall performance of the proposed system is calculated by F-Measure.It is calculated by harmonic mean of precision and recall, F-measure is given by, Training accuracies of sub models and overall system on both the datasets bility at night time eye positioning when driver is in glasses (both the datasets behave similarly).Individually, our proposed state understanding models perform well even though all others are competitive deep networks.The average accuracy of both drowsiness and non-drowsiness of the proposed driver drowsiness detection system is given in Table 5.The ROC-AUC curve comparing with other methods at face positioning and drowsiness detection stage is shown in Figure 8(a) and 8b, it is noted that the proposed system is very much tolerant to the false positive rate, as approximately its rate is lesser than 0.04 and after that the curve shows greater benefit for the proposed system.Figures 9(a

Figure 8a
ROC-AUC of positioning stage for various datasets

Figure 8b
Final ROC-AUC of whole system for various datasets

Figure 9a
Output using KEC-DDD Dataset

Ablation Study
Since we introduced a complex architecture with a new dataset, performing ablation study is a seal of approval.An ablation study is carried out with KEC-DDD dataset for investigating the proposed architecture based on four cases, 1.Without performing face positioning, directly giving raw input to the ST learning phase, 2. Excluding the augmented datasets for performing training, 3. Replacing the suppression

Complexity Analysis
Practically for any CNN based architectures, complexity relies on major parameters like input image size, kernel size and pooling size.Training needs higher computation time than testing, since it has to backpropagate and adjust the weights for required iterations.On the other hand, testing has low time complexity as it only depends on result computation.We calculate complexities that applies for both training and testing of proposed system.Furthermore, theoretically, the complexity of our proposed system is given by maximum of 2D and 3D CNN computations taking in our proposed system represented by    where i represents the computations for 1 to n th convolution layer, W i ,H i ,D i and x i ,y i ,z i denotes width, height, depth of the inputs and width, height, depth of the kernels in i th layer corresponding to the respective 2D and 3D convolutions.Computational complexity of two layered understanding and detection models is O(C*N 2 ), where N is hidden layer dimensions and C is targeting domain.We calculate the execution time by leaving output time display, and achieved around 39.6 FPS at 30.2 ms at an average of 400 execution seconds, which is almost processing a live dynamic real scenario.
The proposed system is developed using TensorFlow library and implemented using Core i7 3.4 GHz 16GB RAM with 12 GB GeForce GTX TITAN X.The execution time of different models are compared in Table 6.
Figure 1(a) Driver state positions

Figure 2 3
Figure 2 Representation of 3D-CNN Architecture.Purple denotes images stacked after face positioning (i.e., input images for DDD) and green denotes extracted Deep ST Values.Yellow denotes pooling layers and grey are convolution layers respectively.Numbers in top are depths and bottom are 3D kernel volumes of its associated layers CNN training in face positioning phase has three objectives, 1. facial classification, 2. regression bounding lines, 3. identifying facial landmarks, to compute results for facial classification, we have two class units (1.face and 2. non-face), for regression bounding lines, we have four class units (1.Left, 2. Right, 3. Length, 4. Width) representing the bounding boxes (candidate window) and for facial landmarks, we have 10 class units (1.Left eye, 2. Not-left eye, 3. Right eye, 4. Not-right eye, 5. Nose, 6. Not-nose, 7. Left end

Figure 4 (
Figure 4 (a-c) A. Represents input images; B. generated ST value representations; C. conditional feature representations CNN training in face positioning phase has three objectives, 1. facial classification, 2. regression bounding lines, 3. identifying facial landmarks, to compute results for facial classification, we have two class units (1.face and 2. non-face), for regression bounding lines, we have four class units (1.Left, 2. Right, 3. Length, 4. Width) representing the bounding boxes (candidate window) and for facial landmarks, we have 10 class units (1.Left eye, 2. Not-left eye, 3. Right eye, 4. Not-right eye, 5. Nose, 6. Not-nose, 7. Left end of lips, 8.Not left end of lips, 9. Right end of lips and 10.

3 .
Training and Augmentation CNN training in face positioning phase has three objectives, 1. facial classification, 2. regression bounding lines, 3. identifying facial landmarks, to compute results for facial classification, we have two class units (1.face and 2. non-face), for regression bounding lines, we have four class units (1.Left, 2. Right, 3. Length, 4. Width) representing the bounding boxes (candidate window) and for facial landmarks, we have 10 class units (1.Left eye, 2. Not-left eye, 3. Right eye, 4. Not-right eye, 5. Nose, 6. Not-nose, 7. Left end of lips, 8.Not left end of lips, 9. Right end of lips and 10.

Figure 5
Figure 5 Augmentation procedure (Creation of 4 different variants to improve test and train size) initial stage of training, annotations for face positioning is created using non-intersection over union suppression technique are listed below, _ Negatives -where regions of NIOU ratio < 0.275 to ground truth _ Positives -where regions of NIOU ratio > 0.625 to ground truth _ Part facial regions -where regions of NIOU ratio is 0.375 ≤ area ≤ 0.675 to ground truth _ Facial points -labelled five facial locations.Region (0.275 < area < 0.375) is left during NIOU, as there exists an undecidable variational gap between negative and part faces.During driver face positioned training, these positives and negatives are used for classification, positives and part faces are used for bounding box and facial points are used to localize and provide additional conformation about the positioned faces.

Figure 6 Few
Figure 6 Few Samples of KEC-DDD dataset; (a) eye closed position; (b) eye rubbing by hand position; (c) eye rubbing by hand without glasses position; (d) head nodding position; (e) lifting eyebrow position; (f ) looking either side distracted position; (g) normal head, eye and mouth position; (h) spectacles falling; (i) talking and laughing; ( j) yawning position )-(b) shows the sample screenshots of drowsiness detection using KEC-DDD and NTHU-DDD dataset.

Figure 10
Figure 10 ROC-AUC of our model blended for various cases in ablation study

Figure 5
Figure 5 Augmentation procedure (Creation of 4 different variants to improve test and train size)

Figure 6 Few
Figure 6 Few Samples of KEC-DDD dataset; (a) eye closed position; (b) eye rubbing by hand position; (c) eye rubbing by hand without glasses position; (d) head nodding position; (e) lifting eyebrow position; (f) looking either side distracted position; (g) normal head, eye and mouth position; (h) spectacles falling; (i) talking and laughing; (j) yawning position ).These outputs are stacked and given as input to 3D-state learning for extracting ST values, we define sub models to understand different driver states.We then use fusion model to understand one or more driver states in a single image using results of sub models.Fusion model extracts conditional features in frames to execute final detection model.
One-hot encoding is a general method of representation of a legal linear combination with high (1) and others a low (0) value.During feature fusion a condition model is developed from these one-hot vectors and ST data.Finally, detection model identifies the drowsiness.
section.This proposed work has a main category of glasses and normal state conditions and four sub categories of driver state conditional elements, 1. Glasses and Normal State O gn 2. Head condition model O h 3. Mouth condition model O m 4. Eye condition model O e and 5. other special condition model O sc .

Table 1
Annotations of Models in State Understanding stage st is learnt ST data from the input x.Wst, Hst and Dst is Width, Height, and Depth of ST data.This ST data can also be defined as the activation values for the hidden layer from the last computed convolution layer in proposed 3D-CNN adaptive state learning model.We designed 3D-CNN with 4 convolution and 2 pooling layers and the detailed architecture of 3D-CNN is given in Figure2.To identify ST data simultaneously, we use 3D-local receptive field and its operation is given by a= ρ�∑ ∑ ∑ �v x,y,z w x,y,z +b� and The output dimensions of the state understanding models will always depends on the target classes to predict.For an instance, the output op of the glasses and normal state understanding model has five target classes for a given ST data as input, similarly sub models are trained to optimize Objective Function (OF) is given as follows, OF (7)�Ȏ, O;θ�={ min ( θ d , θ gn , θ h , θ m , θ e ,θ sc ) γ ∑ �OF gn �O gn ,Ȏ gn � + OF h �O h ,Ȏ h � + i OF m �O m ,Ȏ m �+OF e �O e ,Ȏ e � + OF sc �O sc ,Ȏ sc �� },(7)where O ∈ {Ogn,Oh,Om,Oe,Osc} are annotations of input, and OFgn, OFh, OFm, OFe, and OFsc are the SoftMax cross entropy loss functions that calculate the difference between the actual annotation and the predicted results.Then γ is the hyperparameter for regularizing the sum of error values received from loss functions of all sub models.Further details about training are discussed in training and inference section.Using ST data and the results of state understanding models, a new fusion model is created to form a Conditional feature representation for final detection model.
st is Spatial Temporal values derived from state where st is Spatial Temporal values derived from state learning using 3D-CNN.W hl3 , W hl2 , and W hl1 are weights of hidden layers and W o is weight of output layer, b hl3 , b hl2 , and b hl1 are the bias associated with the hidden layers and b O is the bias in output layer.f hl1 , f hl2 and f hl3 are the activation functions of the hidden layers and f op is the final activation function in output layer.Sub models learns through back propagation, intends to identify a condition for given ST data st, then calculates the difference between predicted and fixed annotations to train network parameters.The output dimensions of the state understanding models will always depends on the target classes to predict.For an instance, the output op of the glasses and normal state understanding model has five target classes for a given ST data as input, similarly sub models are trained to optimize Objective Function (OF) is given as follows, and st is learnt ST data from the input x.Wst, Hst and Dst is Width, Height, and Depth of ST data.This ST data can also be defined as the activation values for the hidden layer from the last computed convolution layer in proposed 3D-CNN adaptive state learning model.We designed 3D-CNN with 4 convolution and 2 pooling layers and the detailed architecture of 3D-CNN is given in Figure2.To identify ST data simultaneously, we use 3D-local receptive field and its operation is given by a= ρ�∑ ∑ ∑ �v x,y,z w x,y,z +b� {O gn ,O h ,O m ,O e ,O sc } are annotations of input, and OF gn , OF h , OF m , OF e , and OF sc are the SoftMax cross entropy loss functions that calculate the difference between the actual annotation and the predicted results.Then γ is the hyperparameter for regularizing the sum of error values received from loss functions of all sub models.Further details about training are discussed in training and inference section.Using ST data and the results of state understanding models, a new fusion model is created to form a Conditional feature representation for final detection model.
Oe and 5. other special condition model Osc.We use onehot vector to define the states and its facial conditions, assigned one-hot vectors are given in Table1.We assume that linear kernels will face difficulty in handling ST data due to highly overlapped distributions and so we use fully connected Neural Network (NN) to deal with ST data carefully.The predictions of the models are represented as{ Ȏ gn =lm gn �a; θ gn �, O gn ∈ S O gn * 1 , Ȏ h =lm h (a; θ h ), O h ∈ S O h *1 , su �Ȏ, O;θ�={ min ( θ d , θ gn , θ h , θ m , θ e ,θ sc ) γ ∑ �OF gn �O gn ,Ȏ gn � + OF h �O h ,Ȏ h � + i OF m �O m ,Ȏ m �+OF e �O e ,Ȏ e � + OF sc �O sc ,Ȏ sc �� }, (7) where O ∈ {Ogn,Oh,Om,Oe,Osc} are annotations of input, and OFgn, OFh, OFm, OFe, and OFsc are the Feature fusion model is designed to learn collection of adaptive-Conditional feature representation from the ST representation st and state conditional annotations Ȏ∈{Ȏgn,Ȏh,Ȏm,Ȏe,Ȏsc}. Using ST data extracted from 3D-CNN, st ∈    *   *   and predicted state conditions of sub models Ȏ, fusion model identifies the collection of adaptive-Conditional feature representation γ.This γ vector is calculated by multiplicative interaction approach [9, 17-19].The highly dependent and relevant features are identified by multiplicative interaction among the is adopted.γ corresponding to Fusion model lm fu is given as, details about training are discussed in training and inference section.Using ST data and the results of state understanding models, a new fusion model is created to form a Conditional feature representation for final detection model.
Pi is probability of being face for input xi and Oi det ∈{0,1} is the expected outcome.
Pi is probability of being face for input xi and Oi det ∈{0,1} is the expected outcome.
Pi is probability of being face for input xi and Oi det ∈{0,1} is the expected outcome.
Pi is probability of being face for input xi and Oi det ∈{0,1} is the expected outcome.

Table 2
Average Validation accuracy of face positioning phase

Table 5
F-Measures and accuracies of DDD using Evaluation set of KEC-DDD Dataset positioning, 3Dconv layer function for ST learning, State understanding and Detection phases, so it is obvious to compare the results with other models at face positioning stage and drowsiness detection stage.We demonstrate our results with the decided evaluation set of KEC-DDD and NTHU-DDD dataset.Performance of the proposed system is measured at face positioning stage and final drowsiness detection stage and compared its results with other models.The positioning phase validation results is presented in Table2.The training accuracies during state understanding phase are shown in Figure7separately for all sub models.The validation accuracies of state understanding models are given by y/x , where y is correct classifications and x is input samples of the respective sub models.Validation results of state understanding phase which composes of five models, 1. Glasses and normal conditions lmgn, 2. Head conditions lmh, 3. Eye Conditions lme, 4. Mouth Conditions lmm and 5. Special conditions lmsc are shown in Table3.The average accuracies are calculated by taking mean of respective heads, so that the total classification numbers can be neutralized.Final average accuracy of state understanding phase is 0.888 for KEC-DDD and 0.866 for NTHU-DDD dataset.It is also observed that high accuracies are found in lmgn and comparatively low accuracies are found in lme, this gaps between sub models are matched by bias of ST learning to support final detection.Output of the state understanding phase is always based on size of the target element, as mouth and eye are small compared to glasses and head region in entire frame in both datasets, models lmgn and lmh may be overfitted and produces good accuracies as shown in Table3.The overall performance of the proposed system is calculated by F-Measure.It is calculated by harmonic mean F-measure= 2(Precision*Recall) Precision+Recall .

Table 6
Comparison for speed across models trained with KEC-DDD DatasetIn this work, architecture for Face positioning, ST learning, State understanding, feature fusion and Detection phases of proposed system is designed and implemented for driver drowsiness detection according to Indian conditions (Indian driver face positions).we started by giving input to face positioning phase that is designed with cascaded three stage 2D convolution layers and face classified outputs are stacked to provide input to Spatio Temporal Learning Stage where the Spatio Temporal values are created and are passed to State understanding phase.Models and sub models defined in state understanding phase are trained to hold the knowledge of respective driver state conditions.Along with the knowledge of state conditions, ST values are passed to feature fusion stage by which a Conditional feature representation is created.This Conditional feature is given as input to final fully connected layer (Detection phase) by which drowsiness and non-drowsiness of the driver is classified.This proposed procedure is carried out using two datasets KEC-DDD (own dataset) and NTHU-DDD training dataset.Additionally, an ablation study to conform ef-fectiveness of our architecture is conducted for four different cases and results are discussed separately.Results of the proposed system are measured for both the datasets at two stages (Face Positioning and Final Detection) and compared with literatures discussed earlier.From the results, it can be concluded that the proposed system outperforms all other methods like 3D-CNN, R-CNN and MultiCNN-Deep Model in Indian conditions (Indian driver face positions) and capable to detect driver drowsiness from 256×256 resolution images at 39.6 fps at an average of 400 execution seconds.Even though we produced acceptable results for driver drowsiness detection, it is still hard to implement in the real time vehicle as we face the limitations like 1. GPU unit in terms of cost, 2. system accepts only labelled samples with huge count at various situations of driver state and 3. System is trained offline and these limitations can be fixed in near future.