A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. The datasets from UCI Machine Learning Repository are chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


Introduction
An outlier could be differentiated from an inlier in such a way that it could be considered a very different observation that might demonstrate very beneficial for some individual or organization. Outlier and noise are two very different entities as the only former one is wanted. Several benefits are enjoyed in practical fields by separating regular data from unexpected data. These irregular forms are also acknowledged as aberration, anomaly, contaminant, discordant observation and exception in many different application fields [9]. A very precise characterization of anomaly could be described as; it is a point that behaves relatively in a different way from other points concerning some characteristics. Density-based anomaly detection generates two kinds of data points, either inlier or outlier as shown in Fig. 1. Inlier is a data point that is surrounded densely by its neighboring points whereas an outlier has relatively fewer neighbor points and hence behaves like an abnormal entity. Several issues need to be considered while detecting anomalies amongst a particular class of data set. These issues require to preprocess certain questions as suggested by Ranga Suri et al. [38], e.g. what method to choose? (either distance-based or density-based), what type of data is? (either numerical or categorical), what is the mode of analysis? (either online or offline). Local neighborhood-based anomaly detection reveals that regular data points occupy the condensed neighborhood, from the other perspective, anomalies are far away from their neighbors, that is., these irregular points inhabit the less condensed neighborhood. Anomaly detection for low dimensional data is processed exhausting conventional procedures which turn into vastly hostile in the perspective of high dimensional data [1]. High dimensional data reveals its inherent problem which shows that the average outcome of all dimensions creates anomalies indistinguishable inside data points. LSOF proves very efficient method while detecting outliers from high dimensional data as it reduces variance among neighboring data points [2]. This problem needs to be engaged in so that anomalies could be made distinguishable. It is observed that low-dimensional projections (spaces comprising a subset of attributes) contain tremendously bulk of anomalies hidden inside high-dimensional data streams [28]. High dimensional subspaces recognize these anomalies as projected anomalies, that is, one anomaly present in one projection might behave normally in another projection [21].
High dimensional data has been employed recently in many different practical fields; it includes recommendation systems, stock exchanges, medical data, electronic vendors and unstructured data [32]. Concrete data and Ionosphere data proves to be a good example of high dimensional data and could be exploited for data digging purposes.
Two major problems are observed regarding anomaly detection for high dimensional data. The specificity of likenesses between data points weakens when the number of dimensions exceeds some limits. A study in [14] demonstrates that, with the propagation of dimensionality, the Euclidean distance between the adjoining neighbor and that to the furthermost point shrinks and causes a reduction in the gap between these two extreme points.  An Overview of Anomaly Detection Techniques tion is concerned with the minimum number of data points and the size of the epsilon neighborhood [42]. reduction in the gap between these two extreme points.

Figure 1
Outlier vs Inlier (Density-based anomaly detection) The complexity of anomaly detection algorithms suffers from the curse of dimensionality, that is; its complexity rises exponentially as dimensionality grows unbounded. When the number of attributes exceeds some limit then a typical anomaly detection algorithm behaves inflexible and unreliable. Therefore, these algorithms become inappropriate and unsuitable when deployed in practical domains [33,47]. Fig. 2 demonstrates a summarized picture of outlier detection techniques regarding low and high dimensional data. It further classifies algorithms in two broad categories on the basis of output, either binary or score. Low dimensional data approaches comprise distance, density, and cluster-based techniques. Vector, subspace, and grid-based techniques are devised and experimented on high dimensional datasets. Density-based Spatial Clustering of Applications (DBSCAN) identifies anomalies as noise. It is a local neighborhood-based methodology that makes groups of data points having random shapes. Its mechanism is based on two elementary ideas which are density connectability and density reachability. Its operation is concerned with the minimum number of data points and the size of the epsilon neighborhood [42].
Another local neighborhood-based methodology known as Local Outlier Factor (LOF) has attracted the attention of researchers as it discovers scored outliers. The core idea behind its operation is that the local density of a certain data point is compared and matched with the local density of its neighbor points. A user selects parameter 'k' which determines the number of neighbors to be processed. Many variants have been proposed to improve the efficiency of the LOF algorithm.
Local Correlation Integral (LOCI) has been acknowledged as a comprehensive anomaly detection technique. Its specialty is that it discovers lonely anomalies along with assembly or group of anomalies. Earlier techniques demand users to choose cutoffs so that a data point could be decided either a normal point or anomaly whereas LOCI determines automatic cutoff and hence gives relief to its users. Another special feature revealed by this methodology is that a point to be observed captures an abundance of information in the vicinity of that point. That is, micro-clusters, macro-clusters, their diameters, and inter-cluster distances are determined through this technique. Optimized results are expected when LOCI is studied and analyzed while tackling inherent problems of high dimensional data. Techniques devised for low dimensional data work efficiently when the number of dimensions is a few. Six to fifteen dimensions are very common in low dimensional data (e.g., Breast Cancer Dataset present on UCI Machine Learning Laboratory Repository), hence the The complexity of anomaly detection algorithms suffers from the curse of dimensionality, that is; its complexity rises exponentially as dimensionality grows unbounded. When the number of attributes exceeds some limit then a typical anomaly detection algorithm behaves inflexible and unreliable. Therefore, these algorithms become inappropriate and unsuitable when deployed in practical domains [33,47]. Fig. 2 demonstrates a summarized picture of outlier detection techniques regarding low and high dimensional data. It further classifies algorithms in two broad categories on the basis of output, either binary or score. Low dimensional data approaches comprise distance, density, and cluster-based techniques. Vector, subspace, and grid-based techniques are devised and experimented on high dimensional datasets. Density-based Spatial Clustering of Applications (DBSCAN) identifies anomalies as noise. It is a local neighborhood-based methodology that makes groups of data points having random shapes. Its mechanism is based on two elementary ideas which are density connect-ability and density reachability. Its opera-Another local neighborhood-based methodology known as Local Outlier Factor (LOF) has attracted the attention of researchers as it discovers scored outliers. The core idea behind its operation is that the local density of a certain data point is compared and matched with the local density of its neighbor points. A user selects parameter 'k' which determines the number of neighbors to be processed. Many variants have been proposed to improve the efficiency of the LOF algorithm. Local Correlation Integral (LOCI) has been acknowledged as a comprehensive anomaly detection technique. Its specialty is that it discovers lonely anomalies along with assembly or group of anomalies. Earlier techniques demand users to choose cutoffs so that a data point could be decided either a normal point or anomaly whereas LOCI determines automatic cutoff and hence gives relief to its users. Another special feature revealed by this methodology is that a point to be observed captures an abundance of information in the vicinity of that point. That is, micro-clusters, macro-clusters, their diameters, and inter-cluster distances are determined through this technique. Optimized results are expected when LOCI is studied and analyzed while tackling inherent problems of high dimensional data. Techniques devised for low dimensional data work efficiently when the number of dimensions is a few. Six to fifteen dimensions are very common in low dimensional data (e.g., Breast Cancer Dataset present on UCI Machine Learning Laboratory Repository), hence the distance between points could be easily differentiated through any normal distance measuring method, e.g., Euclid-reduction in the gap between these two extreme points.

Figure 1
Outlier vs Inlier (Density-based anomaly detection) The complexity of anomaly detection algorithms suffers from the curse of dimensionality, that is; its complexity rises exponentially as dimensionality grows unbounded. When the number of attributes exceeds some limit then a typical anomaly detection algorithm behaves inflexible and unreliable. Therefore, these algorithms become inappropriate and unsuitable when deployed in practical domains [33,47]. Fig. 2 demonstrates a summarized picture of outlier detection techniques regarding low and high dimensional data. It further classifies algorithms in two broad categories on the basis of output, either binary or score. Low dimensional data approaches comprise distance, density, and cluster-based techniques. Vector, subspace, and grid-based techniques are devised and experimented on high dimensional datasets. Density-based Spatial Clustering of Applications (DBSCAN) identifies anomalies as noise. It is a local neighborhood-based methodology that makes groups of data points having random shapes. Its mechanism is based on two elementary ideas which are density connectability and density reachability. Its operation is concerned with the minimum number of data points and the size of the epsilon neighborhood [42]. Another local neighborhood-based methodology known as Local Outlier Factor (LOF) has attracted the attention of researchers as it discovers scored outliers. The core idea behind its operation is that the local density of a certain data point is compared and matched with the local density of its neighbor points. A user selects parameter 'k' which determines the number of neighbors to be processed. Many variants have been proposed to improve the efficiency of the LOF algorithm.
Local Correlation Integral (LOCI) has been acknowledged as a comprehensive anomaly detection technique. Its specialty is that it discovers lonely anomalies along with assembly or group of anomalies. Earlier techniques demand users to choose cutoffs so that a data point could be decided either a normal point or anomaly whereas LOCI determines automatic cutoff and hence gives relief to its users. Another special feature revealed by this methodology is that a point to be observed captures an abundance of information in the vicinity of that point. That is, micro-clusters, macro-clusters, their diameters, and inter-cluster distances are determined through this technique. Optimized results are expected when LOCI is studied and analyzed while tackling inherent problems of high dimensional data. Techniques devised for low dimensional data work efficiently when the number of dimensions is a few. Six to fifteen dimensions are very common in low dimensional data (e.g., Breast Cancer Dataset present on UCI Machine Learning Laboratory Repository), hence the distance between points could be easily differentiated through any normal distance measuring method, e.g., Euclidean metric method. Whereas high dimensional data contains a relatively high number of dimensions This research work is arranged as follows: A local neighborhood-based approach is proposed for exploring anomalies in a dataset having a high number of dimensions. The proposed approach exploits the benefit of some existing techniques, i.e., Distributed LOF [44], INFLO [18], COF [40] and LoOP (which is statistical technique) [25]. The rest of the research work is depicted as follows: In Section 2, we have discussed research motivation, questions and research objectives. Section 3 and 4 elaborate related work and proposed methodology respectively. Experimental work with results is discussed in Section 5. Limitation of the proposed technique is presented briefly in Section 6. Finally, Section 7 concludes the paper.

Research Motivation
During the past few years, low dimensional data is being interchanged with high dimensional data because of speedy advancements in technology. So it is a very essential and demanding situation to invent such systems and algorithms which can challenge and resolve high dimensional data problems. Generation of big data and large data sets have motivated many scientists to redesign algorithms and techniques regarding anomaly detection in high dimensional data. When we deal with real data or real problems, we often deal with high dimensional data that consists of dozens of dimensions. For data miners, finding anomalies within multiples of dimensions becomes not an easy job.
Though it is very common to tackle such situations with dimensionality reduction techniques like PCA (Principal Component Analysis), yet many datasets necessitate considering all dimensions equally relevant. Subspace based techniques like SOD (Subspace based Outlier Detection) are considered suitable but suffer from the curse of dimensionality. Proposed work focusses on full feature spaces of datasets to resolve the issue of least difference in data points when dimensionality grows to remarkable volume. Further it also bears fruits of authentic outlier detection techniques like INFLO (influenced outlierness) technique which detects clusters of different densities residing near to one another.

Problem Statement
Exploring anomalies from high dimensional data through a subset of feature spaces is costly in terms of time and accuracy as digging out subspaces itself is a time-consuming job. Conventional methodologies cannot detect anomalies from high dimensional data due to the specificity of resemblances between data points but these methodologies could be adapted to tackle the above-described problem. In our case, outcomes are estimated to be more precise and computationally less costly as compared to outcomes obtained through a subset of feature subspaces. The exploitation of subspaces or subsets of features resolves likeness of similar data points in a dataset having a large number of dimensions, hence this approach has been utilized in many subspace based outlier detection techniques.
Only the brute force technique guarantees cent percent accuracy while trying all combinations of different subspaces but it is not feasible in reality. Evolutionary techniques like the Genetic Algorithm handles time complexity efficiently but generates optimized results with each next iteration. Conventional density-based outlier detection technique LOF and its variants INFLO and COF are considered state of art techniques while projecting on low dimensional data only. Since these techniques do not utilize a subspace-based approach, so its adaptation for high dimensional data assures better results in terms of time complexity, optimization and memory required to process data.

Research Questions
An analysis needs to be conducted on local neighborhood-based anomaly detection algorithms by revising the variance of attributes for high dimensional data set. The following are major research questions that will be explored and answered inevitably. 1 The likeness of data points regarding high dimensional data needs to respond more intelligently [14]. All data points resemble each other concerning the distance between them. We are to discover whether it is possible to maximize the difference in the distance among data points? Another research area in this regard is to find or improve distance measuring methods to maximize the distance between data elements. For example, the Manhattan Distance metric determines more distance-variation in data points as compared to Euclidean distance, and hence it is suggested to utilize it in high dimensional data sets [11].
2 Curse of dimensionality makes projected subspaces based outlier detection infeasible for high dimensional data sets. A valid question enquires to check possibility of replacing Projected Subspaces based techniques with full space-based techniques?
3 Traditional techniques work on full feature spaces of low dimensional data. These methods fail regarding high dimensional data as outliers are supposed to lie in projected features [30]. A question arises whether traditional techniques devised for full feature subspaces should be adapted (improved) or new techniques (in terms of approaches like vector-based, subspace-based) should be discovered.

Related Work
Hawkins defined "Outlier Detection" which is accepted globally, that is, "An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism [13]. There are many well-known applications of outlier detection like credit card fraud detection, intrusion detection, fault detection, etc. [4]. In a broad sense, there are two classes of outlier detection methods, that is; either supervised or unsupervised, and its choice is dependent upon the nature of data being processed. Major categories of outlier detection are distance-based, density-based and subspace-based. While comparing the above three categories, distance-based approaches pro-duce binary outliers, density-based methods generate scored outliers and subspace-based techniques create both kinds of outliers, binary and scored. In paper [46], a local density estimator (variable sample technique) is implemented by using the T-Forest algorithm. It splits the data into subspaces and finally density of each instance determines score (outlierness).
K-means algorithm is modified to cope with high dimensional problems by introducing multiple centroids and local search strategy during the iterative process [17,23,41].
Jindi and Huaji [19] used a method of density clustering to examine the outliers of data by observing the cow's behavioral characteristics. Density-based clustering can detect local outliers whereas distance-based methods cannot find out. DBSCAN, a well-known clustering method, explores clusters of arbitrary shapes and detects outliers as noise points.
ABOD (Angle Based Outlier Detection) proves to be very efficient for outlier detection of high dimensional data as it is not sensitive to the curse of dimensionality [26]. Broad angles reveal that normal data points exist among clusters whereas a small angle indicates that these observations could be suspected as outliers.
Yan et al. [44] presents a new strategy to obtain optimization for the LOF algorithm. It not only improves costs within each stage but also decreases communication costs for each stage. LOF requires finding k-distance, local reachable density (LRD) and local outlier factor.
Regression analysis in high dimensional data requires careful investigation to avoid some statistical issues like misspecification of the model and inappropriate predictions [7]. Wang et al. [43] proposes a multiple outliers detection approach through multiple testing procedures. To enhance effectiveness, a relatively reliable normal subset of points is obtained by refining outlier detection rule.
Yuan et al. [45] proposed the neighbor-density-deviation-based outlier factor (NDDOF) algorithm which can detect outliers amongst different density clusters.
Further it can detect outliers within objects having relatively smaller clusters.
Liu et al. [29] introduce a trajectory outlier detection Information Technology and Control 2021/1/50 algorithm (TRAOD) which proves bad for local trajectory, so Lee et al. [27] compensated it by proposing another density-based trajectory outlier detection algorithm (DBTOD).
Jindi and Huaji [19] suggest not ignoring global neighborhood while focusing on local neighbors and detect density-based outliers. Its basic purpose is to determine the degree of an outlier compared with other outliers globally and hence improves its rank or score.
Kriegel et al. [25] proposed a very effective search strategy for finding outliers in relevant subspaces, a set of attributes. This strategy is applied to spatial data containing spatial attributes. LoOP: local outlier probabilities, is an algorithm that is very similar to LOF (Local Outlier Factor) except that it does not provide an outlier factor. Rather it utilizes probabilistic set distance to measure the probability of a data point being an outlier.
HiCS: High contrast subspaces resemble subspace outlier detection but its main distinction is that it explores high contrast subspaces which have more probability of holding outliers hidden in subsets of attributes [22]. It detects outliers from the dataset by using LOF but other similar methods can also be utilized.
OutRank, outlier ranking is an algorithm that focusses on finding rank or score of data points. It measures outlierness of data points by analyzing subspaces. Regarding this, it exploits the similarity of subspace measurements and subspace clustering methods [31].
Projected Clustering based on K-Means (PCKA) is a partitioned distance-based clustering algorithm. PCKA is suitable for relevancy analysis of a set of dimensions called subspaces but it lacks redundancy analysis. Proposed PCKA is in improved form as it not only performs relevancy analysis but also redundancy analysis [8].
In paper [6], it is discovered that LOCI is a versatile technique that explores wealth of information by detecting clusters within clusters, but it proves computationally expensive. Feature extraction methods help to reduce redundant and irrelevant data and ultimately help in enhancing the speed of selected techniques [3]. Dimensionality reduction (DM) has been recognized as a good technique to diminish time complexity but it's not valid when all dimensions are significant [2,16].
Anomaly detection is often considered an mandatory tool in exploratory data analysis (EDA). The scientists have found the principal component analysis (PCA) as one of the most popular method for EDA with high-dimensional data. In particular, two-dimensional projections with a few leading PC directions have been found beneficial for detecting hidden patterns in HD data [5,12]. Nevertheless, it is pre-determined that the estimation of PCA for high-dimensional data is often erratic, so the sample version of PCA may not discover anomalies residing in some population PC directions that are not realistically projected [20,34]. Also, since the PC projection plot can only show two directions at a time, it may fail to reveal anomalies that are well concealed in a subspace generated by several PC directions.
Outlier detection regarding distance-based approaches [24] accepts two parameters radius ε and a percentage π, where π percent of all other points must have a distance from point p less than ε. kNN distance models are used to determine labeled outliers where k and ε parameters determine whether a data point is normal or outlier [35]. A variant of LOF known as COF (Connectivity based outlier) was proposed by Tang et al. [40] which solves the problem of low density and isolation. Previously LOF was unable to differentiate between low density and isolation of data points. There is another variation of LOF, i.e., INFLO (Influenced Outlier) [18] which solves those problems in which clusters of different densities could not be separated clearly. It solves this problem by taking the ratio of the average density of things in the vicinity of a point.
Grid-based subspace outlier detection (GOD) [30] partitions data space into an equal depth grid (number of cells in each cell). After calculating the sparsity coefficient of k dimensional grid cells, a negative sparsity coefficient of data points residing in lower-dimensional cells marks these as outliers.
Rehman and Khan [36] have done extensive effort on evaluating different proximity functions when applied to density-based techniques. Results are analyzed and compared in terms of outlier score, inlier score, time complexity, dimensionality variation and for different values of k (minimum points). A novel method, LSC (Local Subspace Classifier) is used in [15] that is based on the feature vector extraction method. LSC determines outlier measure based on time increment for distance applied on the model. This method was improved in terms of computation in [37] by proposing method Fast LSC. In this approach, clustering is used to reduce the amount of data and hence proves ten times faster as compared to the LSC method.
Tang and He [39] detect outliers based on distance function utilizing a density-based approach. He uses three types of measures to determine density estimation which are classic k nearest neighbors, reverse nearest neighbors and shared nearest neighbors.
A comprehensive and precise comparison shown in Table 1 reveals approaches to be adopted, the type of outliers to be detected and the pros and cons of methodology to be utilized.

Problem Statement
Let a DB contains a "d" number of dimensions and 'N' number of data points. Let 'D' denotes set of dimension and represented by D = {m 1 , m 2 ,…, m d } and 'P' represents set of data points where P = {p 1 ,p 2 ,…, p n }. The given task is to find the distance between any two points (D m , P n ) of DB, i.e., n th data point having m th dimension. Standard deviation is determined by calculating the variance of each attribute set for the same dimension (attribute 'p i ' of dimension "d k" ). Attributes having larger variance are normalized. Local outlier factor is determined by fixing parameter 'k' (number of neighbors to be processed), which finally gives scores of each point that could be evaluated and compared with that of traditional techniques (e.g. INFLO) and subspace-based techniques (e.g. SOD).

Variance Calculation
In this step, each attribute is examined and gets classified according to variance present in data. Then all attributes having low standard deviation are normalized as per classified. Standard Deviation is determined after calculating the variance of each attribute.
Standard Deviation for Total Population is determined by using Equation 1.
Standard Deviation for Sample Population is determined by using equation 2.
where N, xi, µ stand for the number of points, i th data point and mean-value of all data points, respectively.

Finding Outlier Degree of each Data Point
k-distance of a data point 'o', "disk(o)" is determined where parameter 'k' (minimum points) is chosen by the user. In the next step, the k-distance neighborhood of a tuple "Neighk(o)" is calculated. Reachability distance of a neighboring point 'p' with respect to 'o' is measured using equation 3.
Standard Deviation for Sample Population is determined by using equation 2.
where N, xi, µ stand for the number of points, i th data point and mean-value of all data points, respectively.

Finding Outlier Degree of each Data Point
k-distance of a data point 'o', "disk(o)" is determined where parameter 'k' (minimum points) is chosen by the user. In the next step, the k-distance neighborhood of a tuple "Neighk(o)" is calculated. Reachability distance of a neighboring point 'p' with respect to 'o' is measured using equation 3. , (2) where N, x i , µ stand for the number of points, i th data point and mean-value of all data points, respectively.

Finding Outlier Degree of each Data Point
k-distance of a data point 'o', "dis k (o)" is determined where parameter 'k' (minimum points) is chosen by the user. In the next step, the k-distance neighborhood of a tuple "Neigh k (o)" is calculated. Reachability distance of a neighboring point 'p' with respect to 'o' is measured using equation 3.

Rdis k (p,o)=maximum(disk(o),disk(p,o)).
In the last step, we measure Local outlier factor "LOFk(o)" of point 'o' which gives the degree of outlierness determined by equation 5.
A higher value of LOF reveals a higher degree for outlierness of a data point whereas lower degree depicts that point as an inlier.
Proposed Methodology Input: numerical data having "d" dimensions and 'N' records Output: data points with a higher degree of outlierness (high LOF)

Step 1: Apply dimension reduction or search relevant attributes (if applicable)
Step 2: Examine the standard deviation of each attribute and identify those having lower values Step 3: Normalize all attributes having lower values.
Step 4: Find k-distance of a tuple, disk(o).
Step 5: Determine the k-distance neighborhood of a tuple, Neighk (o).
Step 6: Find reachability distance of a tuple 'p' with respect to 'o'.

Diff
Top ten strength and sub are ass outliers tools. Fi to concl techniqu

Ex
As des exhibite dimensi outlier f score fo similari As the than tha taxicab local ou dataset differen higher Manhat distance Taxicab two da differen In the last step, we measure Local outlier factor "LOFk(o)" of point 'o' which gives the degree of outlierness determined by equation 5.
A higher value of LOF reveals a higher degree for outlierness of a data point whereas lower degree depicts that point as an inlier.
Proposed Methodology Input: numerical data having "d" dimensions and 'N' records Output: data points with a higher degree of outlierness (high LOF)

Step 1: Apply dimension reduction or search relevant attributes (if applicable)
Step 2: Examine the standard deviation of each attribute and identify those having lower values Step 3: Normalize all attributes having lower values.
Step 4: Find k-distance of a tuple, disk(o).
Step 5: Determine the k-distance neighborhood of a tuple, Neighk (o).
Step 6: Find reachability distance of a tuple 'p' with respect to 'o'.

C Diffe
Top ten strength and subs are assi outliers a tools. Fin to conclu techniqu

Exp
As desc exhibited dimensio outlier fa score for similarity As the v than that taxicab m local out dataset a differenc higher t Manhatta distance Taxicab g two data differenc A higher value of LOF reveals a higher degree for outlierness of a data point whereas lower degree depicts that point as an inlier.

Proposed Methodology
Input: numerical data having "d" dimensions and 'N' records Output: data points with a higher degree of outlierness (high LOF)

Step 1: Apply dimension reduction or search relevant attributes (if applicable)
Step 2: Examine the standard deviation of each attribute and identify those having lower values Step 3: Normalize all attributes having lower values.
Step 4: Find k-distance of a tuple, disk(o).
Step 5: Determine the k-distance neighborhood of a tuple, Neighk (o).
Step 6: Find reachability distance of a tuple 'p' with respect to 'o'.

Comparison of Outlierness with Different Perspectives
Top ten outliers are compared in terms of its strength (score) with traditional density-based and subspace-based techniques. All data points are assigned score to differentiate between outliers and inliers using RapidMiner and ELKI tools. Finally results are analysed and discussed to conclude the pros and cons of the proposed technique.

Experimental Work
As described earlier, similar distances are exhibited by data points when the number of dimensions grows large enough. Hence Local outlier factor of all data points exhibits a similar score for high dimensional data that shows the similarity of all points with respect to distance. As the value of Euclidean distance is smaller than that of Manhattan distance (also known as taxicab metric), so we get different results for local outlier factor outlier applied on the same dataset as shown in Table 2. We can see that the difference of LOF for Manhattan distance is higher than that of Euclidean distance. A Manhattan distance also known as Taxicab distance replaces Euclidean geometry with Taxicab geometry in which the distance between two data points is the sum of the absolute differences of their cartesian coordinates.
So it is obvious that Manhattan distance should be preferred for high dimensional data whenever some outlier detection technique is experimented.  In another experiment, different proximity functions are used to calculate the outlier score when the dimension size of the dataset is gradually increased. As mentioned before, theoretically distance between two data points approaches to zero as dimension size reaches infinity. Practically this distance is so small that it cannot differentiate between outlier (abnormal point) and inlier (normal point). Figure 3 clearly shows that the average outlier score of all data points declines fair enough as dimension size grows. Three different proximity functions, i.e. Euclidean, Manhattan and Squared Euclidean are compared to reveal the effect on outlier scores when the number of dimensions is changed in ascending order. Squared Euclidean distance proves effective Effect of dimensionality on outlier score for different proximity functions outlier detection technique is experimented.
In another experiment, different proximity functions are used to calculate the outlier score when the dimension size of the dataset is gradually increased. As mentioned before, theoretically distance between two data points approaches to zero as dimension size reaches infinity. Practically this distance is so small that it cannot differentiate between outlier (abnormal point) and inlier (normal point). Figure 3 clearly shows that the average outlier score of all data points declines fair enough as dimension size grows. Three different proximity functions, i.e. Euclidean, Manhattan and Squared Euclidean are compared to reveal the effect on outlier scores when the number of dimensions is changed in ascending order. Squared Euclidean distance proves effective for density-based outlier detection techniques when dimension size is large enough.

Figure 3
Effect of dimensionality on outlier score for different proximity functions In this experimental work, two unsupervised datasets named "Concrete Data" and "Appliances Energy Prediction Dataset" are collected from the UCI Machine Learning Laboratory. As a matter of the proposed technique, we determine the mean and standard deviation of each attribute for these for density-based outlier detection techniques when dimension size is large enough.
In this experimental work, two unsupervised datasets named "Concrete Data" and "Appliances Energy Prediction Dataset" are collected from the UCI Machine Learning Laboratory. As a matter of the proposed technique, we determine the mean and standard deviation of each attribute for these datasets. Standard deviation is used to determine variance or spread out present in all attributes. Attributes having lower standard deviation are selected for the normalization process. Attributes showing large variance contribute more for any proximity function and hence require no normalization.
Algorithms of the same class are those algorithms which work on the principle of local density. These algorithms are also known as variants of LOF, which are COF, INFLO and LOOP. Each algorithm calculates the outlier score of each point with respect to the local density of its neighboring data points. In Figure 4 (a) and 4 (b), we evaluate the strength of outliers by comparing the proposed technique with others of the same class. We compare these algorithms in terms of maximum score (max score), minimum score (min score), number of outliers and number of inliers. When experimented on Concrete Dataset shown in Figure 4 (a), for the proposed methodology, outlier scores (max score is 4.0 and min score is 0.93) is relatively higher than that of others, whereas the number of outliers (889) gets better strength in COF more than that of proposed. It is because, COF implementation is based on the connectivity of all data points, hence it finds several outliers in a more precise way. But when we compare run time of COF and proposed methodology, then the proposed one proves better regarding time complexity. Figure 4 (b) exhibits experimentation on Energy Dataset and shows almost similar results as shown for the first dataset. The maximum outlier score is higher than that of other techniques of the same class. There is one exception that the number of outliers for INF-LO differ in both experiments, for the reason that all techniques treat its neighboring points in a slightly different way.
ABOD and SOD are considered reliable outlier detection techniques regarding high dimensional data. These two techniques utilize different approaches as the former one is vector based and calculates outlier scores based on the deviation of angle of a certain point with respect to other data points. The second technique is subspace-based, i.e. different subset of attributes are used to find appropriate subspace that holds outliers embedded in it. Figure 5 (a) and 5 (b) show a comparison of the proposed technique with angle based and subspace-based techniques. Figure  5 (a) reveals that the outlier score of the proposed technique is better than that of ABOD, whereas SOD behaves better in terms of outlier score and number of outliers as well, it is because of finding suitable subspaces that contain distinct outliers. As far as the time complexity of SOD is concerned, it does not defeat the proposed technique. In the second experiment shown in Figure 5 (b), we observe results as expected in terms of outlier score and number of outliers when compared with other class of algorithms i.e. SOD and ABOD.
Time complexity has more concerns for any algorithm/technique when the dimension size of data is large enough. We have already discussed that curse of dimensionality causes an exponential rise in run time as the number of dimensions grows. In this research work, we have compared the time complexity of the proposed technique with techniques of the same class/ approach and of different as well. Figures 6 (a) reveals that the runtime of the proposed methodology is less than other techniques for the same class and different classes as well. Only INFO shows better results but its outlier strength is less than that of proposed technique. In fact there exists a tradeoff between accuracy and runtime while comparing with techniques of the same and different classes. Figure 6 (b) also verifies the above claim that the runtime of the proposed tech- It is a well-established fact that a true relationship exists between the number of dimensions and outlier scores. We have described before that when dimension size is smaller then there is no need to worry as traditional techniques work effectively and efficiently. A high number of dimensions, i.e., high dimensional data requires proper selection of proximity function as obvious for all methods that the outlier score is inversely proportional to dimension size. When compared proposed novel density-based techniques with others, we see that the outlier score has higher values for all dimensions relatively.
Data miners show more interest in data points having the highest scores as these points are likely to contain It is a well-established fact that a true relationship exists between the number of dimensions and outlier scores. We have described before that when dimension size is smaller then there is no need to worry as traditional techniques work effectively and efficiently. A high number of dimensions, i.e., high dimensional data requires proper selection of proximity function as obvious for all methods that the outlier score is inversely proportional to dimension size. When compared proposed novel density-based techniques with others, we see that the outlier score has higher values for all dimensions relatively.
Data miners show more interest in data points having

Limitation
The above-proposed technique works on numerical or continuous data only but it could be adapted for other data types if the distance between data points is quantifiable. For example, the edit distance metric calculates the distance between words containing alphabetical letters.
investigate high dimensional data could be categorized into two aspects, either to explore through full feature space or just subspaces. Local neighborhood-based techniques like LOF, LOCI, COF and INFLO have proved excellent because of its ability to separate clusters of arbitrary shapes. Unfortunately, the abovementioned techniques work efficiently only for nique is very efficient as compared with techniques of the same class and different classes as well It is a well-established fact that a true relationship exists between the number of dimensions and outlier scores. We have described before that when dimension size is smaller then there is no need to worry as traditional techniques work effectively and efficiently. A high number of dimensions, i.e., high dimensional data requires proper selection of proximity function as the distance between any two data points should be visible in terms of its difference. The outlier score of each data point is directly proportional to the dis-

Limitation
The above-proposed technique works on numerical or continuous data only but it could be adapted for other data types if the distance between data points is quantifiable. For example, the edit distance metric calculates the investigate high dimensional data could be categorized into two aspects, either to explore through full feature space or just subspaces. Local neighborhood-based techniques like LOF, LOCI, COF and INFLO have proved excellent because of its ability to separate clusters of arbitrary shapes. Unfortunately, the above-Top ten outliers for proposed and traditional techniques are compared as shown in Figure 8 (a) and 8 (b). The proposed technique reveals outstanding results for the COF algorithm, whereas the INFLO algorithm behaves slightly weak for the third and fourth outliers. Above all, there is an average improvement of outlier scores for the proposed technique when compared with that of traditional local density-based techniques.

Limitation
The above-proposed technique works on numerical or continuous data only but it could be adapted for other data types if the distance between data points is quantifiable. For example, the edit distance metric calculates the investigate high dimensional data could be categorized into two aspects, either to explore through full feature space or just subspaces. Local neighborhood-based techniques like LOF, LOCI, COF and INFLO have proved excellent because of its ability to separate clusters of arbitrary shapes. Unfortunately, the above-tance between that point and its neighboring points. As a matter of proposed technique, we have determined variance amongst each attribute. All attributes having the least spread-out are normalized so that these attributes should not compromise the effect of attributes having large standard deviation values. Figure 7 demonstrates how the outlier score behaves when the size of the dimension is increased when applied on different outlier detection techniques. It is obvious for all methods that the outlier score is inversely proportional to dimension size. When compared proposed novel density-based techniques with others, we see that the outlier score has higher values for all dimensions relatively.
Data miners show more interest in data points having the highest scores as these points are likely to contain information that might prove treasure for any organization or company.
Top ten outliers for proposed and traditional techniques are compared as shown in Figure 8 (a) and 8 (b). The proposed technique reveals outstanding results for the COF algorithm, whereas the INFLO algorithm behaves slightly weak for the third and fourth outliers. Above all, there is an average improvement of outlier scores for the proposed technique when compared with that of traditional local density-based techniques.

Limitation
The above-proposed technique works on numerical or continuous data only but it could be adapted for other data types if the distance between data points is quantifiable. For example, the edit distance metric calculates the distance between words containing alphabetical letters.

Conclusion
During last decade, scientists have recognized anomaly detection as a hot research topic in the domain of data mining. Advancement in computer technology has motivated researchers to shift their focus from low dimensional data to high dimensional data. Techniques to investigate high dimensional data could be categorized into two aspects, either to explore through full feature space or just subspaces. Local neighborhood-based techniques like LOF, LOCI, COF and IN-FLO have proved excellent because of its ability to separate clusters of arbitrary shapes. Unfortunately, the above-mentioned techniques work efficiently only for low dimensional data. High dimensional data wishes its explorers to take care of its embedded issues which are the similarity of data points and curse of dimensionality. Full feature spaces are concerned with the likeness of data points so traditional techniques fail altogether. On the other hand, the accuracy of results is compromised when subspace-based anomaly detection is exploited. This study involves the differentiation of normal and abnormal points through normalized distance metric methods. Each attribute of the data set is examined to find variance so that each attribute is classified and normalized accordingly. In this regard, Local neighborhood-based methodology is adapted for full feature space to detect anomalies present in high dimensional data.