aboutsummaryrefslogtreecommitdiff
path: root/KOR.addrlink/vignettes/Example.Rnw
diff options
context:
space:
mode:
Diffstat (limited to 'KOR.addrlink/vignettes/Example.Rnw')
-rw-r--r--KOR.addrlink/vignettes/Example.Rnw143
1 files changed, 143 insertions, 0 deletions
diff --git a/KOR.addrlink/vignettes/Example.Rnw b/KOR.addrlink/vignettes/Example.Rnw
new file mode 100644
index 0000000..aef9c2e
--- /dev/null
+++ b/KOR.addrlink/vignettes/Example.Rnw
@@ -0,0 +1,143 @@
+\documentclass{article}
+%\VignetteIndexEntry{Example}
+\usepackage[utf8]{inputenc}
+\begin{document}
+\SweaveOpts{concordance=TRUE}
+\title{Using KOR.addrlink}
+\author{Daniel Sch\"urmann}
+\date{February 29, 2024}
+\maketitle
+
+\section{Introduction}
+
+Consider a data set with semi-structured address data, e.g. street and house number as a concatenated string,
+wrongly spelled street names or non-existing house numbers. This data set (referred to as df\_match) should be
+mapped to a complete list of valid addresses within the given municipality. The latter data set is
+called df\_ref and may include further information like coordinates of addresses or district information.
+KOR.addrlink tries to solve this problem specifically for German municipalities as the package is specialized
+on German address schemes.
+
+\section{Reference data}
+
+First, a complete list of reference addresses (df\_ref) is needed. An example
+data.fame named "Adressen" is shown below.
+
+<<>>=
+library(KOR.addrlink)
+Adressen[c(sample(which(is.na(Adressen$HNRZ)), 4),
+ sample(which(!is.na(Adressen$HNRZ)), 2)),]
+@
+
+The columns used for the matching procedure are STRNAME (street name), HNR (house number)
+and HNRZ (additional letter). This vignette illustrates the merging workflow on two sample data sets called df1 and df2.
+
+\section{Example 1}
+df1 has address information in columns gross\_strasse and housnr.
+The columns Var1 and Var2 provide non-address related information about
+the individuals. Row 1183 shows that the column hausnr needs to be split
+into house number and additional letter before addresses can be matched.
+The function split\_number is provided for that task.
+
+<<>>=
+df1[1180:(1183+6),]
+@
+
+split\_number takes hausnr and creates a data.frame with columns "Hausnummer"
+(house number) and "Hausnummernzusatz" (additional letter).
+
+<<>>=
+df1 <- cbind(df1, split_number(df1$hausnr))
+df1[1180:(1183+6),]
+@
+
+addrlink merges the two data sets. For both data sets, the columns referring
+to steet name, house number and additional letter need to be specified
+in exactly that order (parameter col\_ref and col\_match).
+
+<<>>=
+# column hausnr is no longer needed
+df1 <- within(df1, rm(hausnr))
+df1_matched <- addrlink(df_ref = Adressen,
+ col_ref = c("STRNAME", "HNR", "HNRZ"),
+ df_match = df1,
+ col_match = c("gross_strasse", "Hausnummer", "Hausnummernzusatz"))
+@
+
+The result is a list with two data.frames
+\begin{itemize}
+\item ret: The merged data set
+\item QA: Indicators showing the match quality
+\end{itemize}
+
+<<>>=
+head(df1_matched$ret)
+table(df1_matched$QA$qAddress)
+@
+
+qAdress states the stage within the matching procedure that yielded the match.
+Out of the 10000 records, 9670 could be merged directly. 72 had a valid street
+name, but an invalid house number. 157 records had (possibly) misspelled street
+names and 101 records could not be matched at all.
+
+\section{Example 2}
+
+The second data set has a single column "Adresse", which includes street names
+and house numbers. Thus, this column needs to be split by the function
+split\_address.
+
+<<>>=
+head(within(df2, Adresse <- trimws(Adresse)))
+@
+
+split\_number creates a data.frame with columns "Strasse" (street) "Hausnummer"
+(house number) and "Hausnummernzusatz" (additional letter) from the column
+"Adresse".
+
+<<>>=
+df2 <- cbind(df2, split_address(df2$Adresse))
+within(df2, Adresse <- trimws(Adresse))[23:(23+6),]
+@
+
+Again, addrlink merges the two data sets. The parameter fuzzy\_threshold
+sets the threshold for fuzzy matching of misspelled street names. A value
+of 1 means no fuzzy matching and 0 means forced fuzzy matches for all records.
+If a steet name could be matched, but the provided house number does not exist, addrlink
+may randomly assign a valid house number to that record. A seed is always set
+to ensure reproducibility. Customization is possible via the parameter seed.
+
+<<>>=
+# column Adresse is no longer needed
+df2 <- within(df2, rm(Adresse))
+df2_matched <- addrlink(df_ref = Adressen,
+ col_ref = c("STRNAME", "HNR", "HNRZ"),
+ df_match = df2,
+ col_match = c("Strasse", "Hausnummer", "Hausnummernzusatz"),
+ fuzzy_threshold = .9, seed = 1234)
+@
+
+<<>>=
+head(df2_matched$ret)
+table(df2_matched$QA$qAddress)
+@
+
+49 records had invalid house numbers and one record was matched by
+fuzzy matching. This record can be inspected in detail.
+
+<<>>=
+id <- which(df2_matched$QA$qAddress == 3)
+df2_matched$ret[id,]
+df2_matched$QA[id,]
+@
+
+In this case the fuzzy matching procedure was most likely correct
+(St.-Georg-Str. matched SANKT-GEORG-STRA{\ss}E).
+
+The number of cases with correct street name and randomly assigned house
+numbers is 10.
+
+<<>>=
+sum(df2_matched$QA$qscore == 0)
+@
+
+
+\end{document}