Commit a2d9bbf1 authored by Weigert, Andreas's avatar Weigert, Andreas
Browse files

added exercise script, data, and additional functions for Tutorial 5

parent 83aaed97
---
title: 'Tutorial 5: Data understanding and preparation'
output: html_notebook
editor_options:
chunk_output_type: inline
---
This file is part of the lecture Business Intelligence & Analytics (EESYS-BIA-M), Information Systems and Energy Efficient Systems, University of Bamberg.
# Read the APC dataset
```{r Read data}
APC <- read.csv2("../data/APC-dataset-anonym.csv")
```
# Overview of the APC dataset
```{r Get an overview I}
```
```{r Get an overview II}
```
# Data preparation
## Convert all data suitable class (without consumption)
```{r Data preparation}
# Task 1: most suitable class
# convert ID to factor
# convert postal code to factor
# convert strasse to character
# convert Betreff to character
# convert FilterNonHousehold to boolean
# convert portal usage to boolean
# Task 2: Correlation of NA values in post code and street name
# Are NAs in street and PLZ the same? - yes: 63 households have neither Street nor postal code given
# Task 3: Betreff
# Task 4: Differences between NA-like values
# Task 5:
# Convert dates to correct format - read the help page of "strptime()"
# Check the result of the data conversion part
```
## Convert consumption data
```{r Data preparation for Consumption values}
# Task 6: Negative consumption values
# Exclude these values by setting them to NA
# Task 7: Finding pairs of NA / non-NA values
# No change required
# Task 8: Calculate normalized consumption
# set the infinite values to NA
```
# Parts of a data quality report
```{r}
# Task 9: Data exploration report for consumption values
```
```{r}
# test whether the data is normally distributed
```
# Identification of outlier
## Identification of outlier using boxplots
```{r Identify outlier and filter data}
# Task 10: Identify outliers with boxplot
```
## Identification of outliers with IQR and sigma function
```{r}
# Task 11: Identify outliers with IQR and sigma functions
source("../R/functions/outlier_identification.R")
# apply the functions
```
```{r}
# Task 12: Data quality report by using filter for missing data
# we create filter columns for different aspects:
# 1) address data,
# 2) consumption data existent,
# 3) portal usage data
```
# Outlier Identification
#function to identify outliers with the 1.5x inter quartile range (IQR) rule
is.outlier_IQR <- function(v){
quant_v <- quantile(v, na.rm=T)
whisker_def <- 1.5 * (quant_v["75%"] - quant_v["25%"])
return((v > quant_v["50%"] + whisker_def) | (v < quant_v["50%"] - whisker_def))
}
#function to identify outliers with the sigma rule
is.outlier_sigma <- function(v, sigma=3){
z_score <- (v - mean(v, na.rm = T)) / sd(v, na.rm = T)
return(abs(z_score)>sigma)
}
\ No newline at end of file
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment