stranger for the impatient

WeLoveDataScience

2018-03-12

I am so impatient: I want to Get some anomalies!

Okay… so you checked your favorite method is here. Let’s say k-nearest neightbors as implemented in FNN package.

Very impatient people may detect some suspicious records using lucky_odds - though we recommend being a little more patient: unsupervised anomaly detection requires some patience.

Stop talking: show now!

Loading package.

library(stranger)

FNN package will be automatically loaded. At this step ensure it is currently available (installed) in your environment.

On our favorite classical dataset, we perform computations directly followed by visualisation of top 6 anomalies according to our knn approach.

anomalies <- iris %>% lucky_odds(n.anom=6, weird="knn", analysis.drop="Species")
## Loading required package: FNN
anomalies %>% ggplot(aes(x=Sepal.Length,y=Sepal.Width,color=Species,size=flag_anomaly))+
  geom_point()+
  scale_size_discrete(range=c(1,3))

Uh? What happened?

Yeks! That is very few commands! And I don’t clearly see what happened. Can you explain?

Sure. You ask for 6 anomalies using knn weird method, as computed on all numeric variables of iris dataset. As this method is you preferred one, you may know it has / should have some parameters.

We defined some default for you (no certitude they are the best ones…). Don’t worry, you will be able to modify them according to your preferences/needs.

There are several things that happened in the background - corresponding to stranger analysis flow.

There is clearly no assurance that lucky_odds gives you relevant anomalies, hence the lucky part of the name. Still this function has some parameters that can give you some control on the outputs but knowing the detailed process will allow you better understand those parameters.

Anomaly detection process with stranger

In stanger, we have decomposed the process in some components having associated verbs.

*Merge back with data Source data enrichment with either metrics or anomaly flags is performed thanks to dedicated fortify methods.

At every of those steps, user can fine-tune the analysis thanks to some parameters.

Here is an example of explicitely invoke the whole processes step by step.

iris %>% select(-Species) %>%       # remove Species - meaningless to define anomalies
  crazyfy() %>%                     # data preparation: add id, rescale [0,1], eventualy takes log
  strange("knn",k=5) %>%            # computes knn distances to 5 neighbors, aggregate distances with the mean (default) 
  singularize() %>%                 # required step if no manual selection
  get_anomalies(nmin=6,nmax=10) %>%  # ask for some anomalies
  fortify(data=iris,colname="anomaly") %>%          # enrich iris dataset with a flag called "anomaly" (0/1)
  ggplot(aes(x=Sepal.Length,y=Sepal.Width,color=Species,size=anomaly))+   # let's view!
    geom_point()+
    scale_size_discrete(range=c(1,3))
## 
## Ordering: -1xN_anom_norm_avg and -1xN_anom_norm_avg
##  0 outliers found by default with this stacking and this method, which is not enough to satisfy nmin= 6 . Following most probable outliers are added to selection.

Got it! I want more!

Visualize anomalies

Understanding the reasons behind the score of a suspicious record is quite important to help people to trust the models, even if they don’t fully understand them. Interpretability is also critical to facilitate model validation and investiagation of suspicious cases. Unfortunately, scores made by machine learning models are usually difficult to explain, especially to non-data scientists. So, we will show in this section how this problem of interpretability can be partially adressed with few plots on surrogate models. We know that small regions of the conditional distribution, such as the k-nearest neighboors of an input record and their corresponding score, can likely be approximated by linear model. So by only focusing on that small region, we can use simple models (i.e. a surrogate model) to explain locally the more complex global model and visualize these simple models with basic plots. There are however no guarantees that the surrogate model is highly representative of the more complex model. So make sure that the results of the surrogate models are in line with human domain knowledge.

All the provided plots zoom on a specific record and its k-nearest neighbours, but with a different view each time.

Today, 5 visualizations are implemented;

Those functions are available though plot methods, usable on the objects in various steps of the process.

For every graph, a given weird metric must be provided.

Let start for instance to build a stranger object called is from iris dataset with all default (knn weird with k=10).

is <- iris %>% 
   select(-Species) %>%
   crazyfy() %>%
   strange() 

Similarity between top anomalies

The option type = "cluser" shows a hierarchical cluster among the top highest score around a specif record. The plot on the top, called “Full Tree”, helps you to select the number of cluster according to the zoom level of your interest. The plot on the bottom, called " Zooming-in“, zooms in the cluster in which the selected records belongs to. Finding the commun pattern within a cluster (or sub-cluster) should provide you insights on what has driven the score.

is %>%   plot(type="cluster",score = "knn_k_10_mean")
## [1] 132
## Your data has been converted to a dataframe to be compatible with ggplot function.

Note on unshown default parameters

Call with all parameters (except the ones specific to the type of plot) could be - with default values used:

is %>% plot (type="cluster",
id = ".id",
score = "knn_k_10_mean")
## [1] 132
## Your data has been converted to a dataframe to be compatible with ggplot function.

In stranger default process, required id is named .id, thus default value is suitable.

The parameter anomaly_id is used to pick one specific anomaly as identified by its id. Using the default value NULL the record having highest anomaly score (here based on knn_k_10_mean) is used.

Univariate relationship

The option type = "neighbours" shows a dots plot with a loess regression that emphasis the relationship between the anomly score and each numerical feature for a specific record and its k-nearest neighbours.

is %>% plot (type="neighbours",score = "knn_k_10_mean")
## [1] 132
## Your data has been converted to a dataframe to be compatible with ggplot function.

Decision tree

The option type = "regression_tree" shows roots to lead to high score locally around a specific record. The function use a pruned rpart regression tree where score is used as the dependent feature and all others feature as covariates. This approaches has been inspired by the model [LIME] (https://arxiv.org/pdf/1602.04938.pdf).

is %>% plot(type="regression_tree",score = "knn_k_10_mean")
## [1] 132
## Your data has been converted to a dataframe to be compatible with ggplot function.
## Loading required package: rpart

Variables importance

The option type = "feature_importance" shows a dotplot sorted by decreasing variable importance. Such variable importance is derived from fitting a random forest using anomaly id as a target.

Due to the approach, this graphic is not suited for small data (both in terms of records and columns).

Fitting random forest is performed with ranger package for efficiency purpose.

Note that: - A list of id may be provided. - For that plot method, score should be specified and is only used when anomaly_id is NULL (to determine top anomaly).

is %>% plot (type="feature_importance",score = "knn_k_10_mean")
## [1] 132
## Your data has been converted to a dataframe to be compatible with ggplot function.
## Loading required package: ranger
## Warning: package 'ranger' was built under R version 3.4.3

Score decline

The option type = "score_decline" shows a dots plot that highlight the decline in anomaly score among the k nearest neighbours of a specific record. The shape is an indicator of how extrem and how frequent is the score of a specific record amoung its neighbours.

is %>% plot(type="scores_decline",score = "knn_k_10_mean")
## [1] 132
## Your data has been converted to a dataframe to be compatible with ggplot function.

To go further…

Explore additional vignettes. Next recommended vignette deals about working with several methods at once: * Working with several methods () introduce the usage of stranger function, the possibility to merge and stack (aggregate) methods with singularize.

Then you will be interested in knowing the list of available methods, which you can find there: * Stranger weirds methods (vignette) list all available weirds with some information on them.