Run multidetect and clean the data
This process detects outliers using various methods and cleans the data: Ensemble multiple outlier detection methods to ably compare the outliers flagged by each method; then extract final clean data using either absolute or best method generated outliers. For more details, please ask BOKU.
pan-european use case AquaINFRA R
| Id | Title | Data Type | Description |
|---|---|---|---|
| input_data | Input table | string | URL to the input table containing the data to be cleaned from outliers: Data sets for multiple or single species from pred_extract and other sources. |
| colname_variable | Variable of interest | string | Column name identifying the variable of interest where outliers will be checked from for univariate outlier detection methods such as Z-score, mixed interquantile range, reverse jackknifing |
| select_columns | Specify columns to be checked | string | In a multivariate dataset, if only particular columns needs to be checked then they should be indicated here. Otherwise all columns will be considerd in outlier detection. |
| multiple_species | Distinguish between multiple species | boolean | If NO, then only a single species dataset is expected. |
| output_type | Output type | string | Set whether you want to return outliers or clean dataset. Example: outlier. |
| group_colname | Column name including group names | string | For multiple groups in a dataframe, provide the column name containing the groups to be checked. For example, a column name with species name in a dataset. |
| colname_exclude | Column names to exclude | string | Exclude mainly numerical variables that are not necessary in the analysis, for example x and y columns or latitude/longitude, row numbers or serial IDs. Categorical variables are removed automatically in the data preparation. |
| methods | Outlier detection methods | string | The name of methods for Outlier detection to be used, as a comma-separated string. Example = "mixediqr, logboxplot, iqr, distboxplot, jknife, semiqr, hampel, iforest, lof, mahal". |
| silence_true_errors | Silence methods that geuninely fail during outlier detection | boolean | If YES, silence errors for methods that genuinely druing the outlier detection process but continue without breaking other methods. |
| boot_run | Bootstrapping execution | boolean | If set to YES, then bootstrapping will be done for small samples. |
| boot_maxrecords | Maximum records to intiate bootstrapping | integer | The user can adjust the maximum records were to be be bootstrapped. |
| number_of_boots | Number of bootstraps | integer | The number of bootstraps to generate during bootstrapping. |
| setseed | Set seed | integer | During bootstrapping, random samples are generated that requires to set a seed for reproducibility. |
| boot_threshold | Threshold value to flag a record from bootstrap samples | number | As a record is flagged in multiple bootstrap samples, a threshold is required to extract an outlier. For instance, 0.6, meaning if a record is flagged 6 of the 10 bootsraps, will be flagged as an outlier. |
| exceute_pca | Execute Principal Component Analysis | boolean | If true, then PCA will be intiated. |
| number_of_pca | Number of principal components to retain | integer | The user can indicate the maximum number of principal components to retain in the outlier detection. |
| pca_silence | Hide messages during PCA analysis | boolean | Messages during PCA analyis are returned if set to NO. YES means to silence the messages like the variance explained from the total PCs selected. |
| pcavariable | Select the principal component among the PCs retained to be used as variable of interest | string | PC1 is the variable selected for outlier detection. Other PC2 can can also be set. PC1 is advisable. |
| sdm_data | Change the outlier detection routine based on data type | boolean | If it is univariate data, then set to NO. All data that requires multivariate analysis such as using kmeans, isolation forest, set to YES. |
| inform_na_outlier | Hide messages for removing NAs | boolean | If set to YES, NAs removed will be displayed for each group variable. |
| missingness | Percentage missing values in a group | number | Allowed missing values in a column to allow a user decide whether to remove the individual columns or rows from the data sets. For instance, 0.1: If a column has more than 10 % missing values, then it will be removed from the dataset rather than the rows. |
| classify_or_autoremove | Either use outlier classification or autoremoval with threshold or LOESS method | boolean | The parameter allows to switch from outlification that labels all records as perfect outlier to fair outliers to allows further scrutiny. Otherwise, the outliers will be dropped based on a threshold set naively or using LOESS method. |
| bool_loess | Data extraction parameter: LOESS | boolean | If set to true then the local regression method for data extraction will be used |
| threshold_clean | Data extraction parameter: Threshold setting | number | If Data extraction parameter: LOESS is NO, then a threshold value can be provided. The threshold will significantly determine which is flagged as an absolute outlier. |
| outlierweights_mode | Data extraction parameter: Outlier weighting | string | Outlier weighting and selection methods. abs uses record proportional to identify an absolute outlier. Example: abs |
| classifymode | Cuts data into classess | integer | Categorize data base on the correlation coefficient manner based on Akoglu 2018. Example: med |
| eif_bool | Emprical Influence Function | boolean | Computes the Emprical Influence Function for classified outliers. |
| Id | Title | Description |
|---|---|---|
| cleaned_data | Cleaned data | A table with data extracted from the reference dataset set during outlier detection. |
Educational resources and documentation
Jobs
Browse jobsLinks
- GitHub repo (text/html)
- Vignette containing this example (text/html)
- Process description as JSON (application/json)
- Process description as HTML (text/html)
- Jobs list as HTML (text/html)
- Jobs list as JSON (application/json)
- Execution for this process as JSON (application/json)
Execution modes
- Synchronous
- Asynchronous
