Run multidetect and clean the data

This process detects outliers using various methods and cleans the data: Ensemble multiple outlier detection methods to ably compare the outliers flagged by each method; then extract final clean data using either absolute or best method generated outliers. For more details, please ask BOKU.

pan-european use case AquaINFRA R

Inputs
Id	Title	Data Type	Description
input_data	Input table	string	URL to the input table containing the data to be cleaned from outliers: Data sets for multiple or single species from pred_extract and other sources.
colname_variable	Variable of interest	string	Column name identifying the variable of interest where outliers will be checked from for univariate outlier detection methods such as Z-score, mixed interquantile range, reverse jackknifing
select_columns	Specify columns to be checked	string	In a multivariate dataset, if only particular columns needs to be checked then they should be indicated here. Otherwise all columns will be considerd in outlier detection.
multiple_species	Distinguish between multiple species	boolean	If NO, then only a single species dataset is expected.
output_type	Output type	string	Set whether you want to return outliers or clean dataset. Example: outlier.
group_colname	Column name including group names	string	For multiple groups in a dataframe, provide the column name containing the groups to be checked. For example, a column name with species name in a dataset.
colname_exclude	Column names to exclude	string	Exclude mainly numerical variables that are not necessary in the analysis, for example x and y columns or latitude/longitude, row numbers or serial IDs. Categorical variables are removed automatically in the data preparation.
methods	Outlier detection methods	string	The name of methods for Outlier detection to be used, as a comma-separated string. Example = "mixediqr, logboxplot, iqr, distboxplot, jknife, semiqr, hampel, iforest, lof, mahal".
silence_true_errors	Silence methods that geuninely fail during outlier detection	boolean	If YES, silence errors for methods that genuinely druing the outlier detection process but continue without breaking other methods.
boot_run	Bootstrapping execution	boolean	If set to YES, then bootstrapping will be done for small samples.
boot_maxrecords	Maximum records to intiate bootstrapping	integer	The user can adjust the maximum records were to be be bootstrapped.
number_of_boots	Number of bootstraps	integer	The number of bootstraps to generate during bootstrapping.
setseed	Set seed	integer	During bootstrapping, random samples are generated that requires to set a seed for reproducibility.
boot_threshold	Threshold value to flag a record from bootstrap samples	number	As a record is flagged in multiple bootstrap samples, a threshold is required to extract an outlier. For instance, 0.6, meaning if a record is flagged 6 of the 10 bootsraps, will be flagged as an outlier.
exceute_pca	Execute Principal Component Analysis	boolean	If true, then PCA will be intiated.
number_of_pca	Number of principal components to retain	integer	The user can indicate the maximum number of principal components to retain in the outlier detection.
pca_silence	Hide messages during PCA analysis	boolean	Messages during PCA analyis are returned if set to NO. YES means to silence the messages like the variance explained from the total PCs selected.
pcavariable	Select the principal component among the PCs retained to be used as variable of interest	string	PC1 is the variable selected for outlier detection. Other PC2 can can also be set. PC1 is advisable.
sdm_data	Change the outlier detection routine based on data type	boolean	If it is univariate data, then set to NO. All data that requires multivariate analysis such as using kmeans, isolation forest, set to YES.
inform_na_outlier	Hide messages for removing NAs	boolean	If set to YES, NAs removed will be displayed for each group variable.
missingness	Percentage missing values in a group	number	Allowed missing values in a column to allow a user decide whether to remove the individual columns or rows from the data sets. For instance, 0.1: If a column has more than 10 % missing values, then it will be removed from the dataset rather than the rows.
classify_or_autoremove	Either use outlier classification or autoremoval with threshold or LOESS method	boolean	The parameter allows to switch from outlification that labels all records as perfect outlier to fair outliers to allows further scrutiny. Otherwise, the outliers will be dropped based on a threshold set naively or using LOESS method.
bool_loess	Data extraction parameter: LOESS	boolean	If set to true then the local regression method for data extraction will be used
threshold_clean	Data extraction parameter: Threshold setting	number	If Data extraction parameter: LOESS is NO, then a threshold value can be provided. The threshold will significantly determine which is flagged as an absolute outlier.
outlierweights_mode	Data extraction parameter: Outlier weighting	string	Outlier weighting and selection methods. abs uses record proportional to identify an absolute outlier. Example: abs
classifymode	Cuts data into classess	integer	Categorize data base on the correlation coefficient manner based on Akoglu 2018. Example: med
eif_bool	Emprical Influence Function	boolean	Computes the Emprical Influence Function for classified outliers.

Outputs
Id	Title	Description
cleaned_data	Cleaned data	A table with data extracted from the reference dataset set during outlier detection.

Run multidetect and clean the data

Educational resources and documentation

Jobs

Links

Execution modes