Run multidetect and clean the data

This process detects outliers using various methods and cleans the data: Ensemble multiple outlier detection methods to ably compare the outliers flagged by each method; then extract final clean data using either absolute or best method generated outliers. For more details, please ask BOKU.

pan-european use case AquaINFRA R

Inputs
Id Title Data Type Description
input_data Input table string URL to the input table containing the data to be cleaned from outliers: Data sets for multiple or single species from pred_extract and other sources.
colname_variable Variable of interest string Column name identifying the variable of interest where outliers will be checked from for univariate outlier detection methods such as Z-score, mixed interquantile range, reverse jackknifing
select_columns Specify columns to be checked string In a multivariate dataset, if only particular columns needs to be checked then they should be indicated here. Otherwise all columns will be considerd in outlier detection.
multiple_species Distinguish between multiple species boolean If NO, then only a single species dataset is expected.
output_type Output type string Set whether you want to return outliers or clean dataset. Example: outlier.
group_colname Column name including group names string For multiple groups in a dataframe, provide the column name containing the groups to be checked. For example, a column name with species name in a dataset.
colname_exclude Column names to exclude string Exclude mainly numerical variables that are not necessary in the analysis, for example x and y columns or latitude/longitude, row numbers or serial IDs. Categorical variables are removed automatically in the data preparation.
methods Outlier detection methods string The name of methods for Outlier detection to be used, as a comma-separated string. Example = "mixediqr, logboxplot, iqr, distboxplot, jknife, semiqr, hampel, iforest, lof, mahal".
silence_true_errors Silence methods that geuninely fail during outlier detection boolean If YES, silence errors for methods that genuinely druing the outlier detection process but continue without breaking other methods.
boot_run Bootstrapping execution boolean If set to YES, then bootstrapping will be done for small samples.
boot_maxrecords Maximum records to intiate bootstrapping integer The user can adjust the maximum records were to be be bootstrapped.
number_of_boots Number of bootstraps integer The number of bootstraps to generate during bootstrapping.
setseed Set seed integer During bootstrapping, random samples are generated that requires to set a seed for reproducibility.
boot_threshold Threshold value to flag a record from bootstrap samples number As a record is flagged in multiple bootstrap samples, a threshold is required to extract an outlier. For instance, 0.6, meaning if a record is flagged 6 of the 10 bootsraps, will be flagged as an outlier.
exceute_pca Execute Principal Component Analysis boolean If true, then PCA will be intiated.
number_of_pca Number of principal components to retain integer The user can indicate the maximum number of principal components to retain in the outlier detection.
pca_silence Hide messages during PCA analysis boolean Messages during PCA analyis are returned if set to NO. YES means to silence the messages like the variance explained from the total PCs selected.
pcavariable Select the principal component among the PCs retained to be used as variable of interest string PC1 is the variable selected for outlier detection. Other PC2 can can also be set. PC1 is advisable.
sdm_data Change the outlier detection routine based on data type boolean If it is univariate data, then set to NO. All data that requires multivariate analysis such as using kmeans, isolation forest, set to YES.
inform_na_outlier Hide messages for removing NAs boolean If set to YES, NAs removed will be displayed for each group variable.
missingness Percentage missing values in a group number Allowed missing values in a column to allow a user decide whether to remove the individual columns or rows from the data sets. For instance, 0.1: If a column has more than 10 % missing values, then it will be removed from the dataset rather than the rows.
classify_or_autoremove Either use outlier classification or autoremoval with threshold or LOESS method boolean The parameter allows to switch from outlification that labels all records as perfect outlier to fair outliers to allows further scrutiny. Otherwise, the outliers will be dropped based on a threshold set naively or using LOESS method.
bool_loess Data extraction parameter: LOESS boolean If set to true then the local regression method for data extraction will be used
threshold_clean Data extraction parameter: Threshold setting number If Data extraction parameter: LOESS is NO, then a threshold value can be provided. The threshold will significantly determine which is flagged as an absolute outlier.
outlierweights_mode Data extraction parameter: Outlier weighting string Outlier weighting and selection methods. abs uses record proportional to identify an absolute outlier. Example: abs
classifymode Cuts data into classess integer Categorize data base on the correlation coefficient manner based on Akoglu 2018. Example: med
eif_bool Emprical Influence Function boolean Computes the Emprical Influence Function for classified outliers.
Outputs
Id Title Description
cleaned_data Cleaned data A table with data extracted from the reference dataset set during outlier detection.

Educational resources and documentation

Jobs

Browse jobs

Links

Execution modes

  • Synchronous
  • Asynchronous