Details

Statistical Data Cleaning with Applications in R


Statistical Data Cleaning with Applications in R


1. Aufl.

von: Mark van der Loo, Edwin de Jonge

CHF 59.00

Verlag: Wiley
Format: PDF
Veröffentl.: 29.01.2018
ISBN/EAN: 9781118897140
Sprache: englisch
Anzahl Seiten: 320

DRM-geschütztes eBook, Sie benötigen z.B. Adobe Digital Editions und eine Adobe ID zum Lesen.

Beschreibungen

<p><b>A comprehensive guide to automated statistical data cleaning</b> </p> <p>The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. <i>Statistical Data Cleaning</i> brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy.</p> <p><i>Key features:</i></p> <ul> <li>Focuses on the automation of data cleaning methods, including both theory and applications written in R. <ul> <li>Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis.</li> <li>Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring.</li> <li>Supported by an accompanying website featuring data and R code.</li> </ul> </li> </ul> <p>This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses. </p>
<p>Foreword xi</p> <p>About the Companion Website xiii</p> <p><b>1 Data Cleaning 1</b></p> <p>1.1 The Statistical Value Chain 1</p> <p>1.1.1 Raw Data 2</p> <p>1.1.2 Input Data 2</p> <p>1.1.3 Valid Data 3</p> <p>1.1.4 Statistics 3</p> <p>1.1.5 Output 3</p> <p>1.2 Notation and Conventions Used in this Book 3</p> <p><b>2 A Brief Introduction to R 5</b></p> <p>2.1 R on the Command Line 5</p> <p>2.1.1 Getting Help and Learning R 6</p> <p>2.2 Vectors 7</p> <p>2.2.1 Computing with Vectors 9</p> <p>2.2.2 Arrays and Matrices 10</p> <p>2.3 Data Frames 11</p> <p>2.3.1 The Formula-Data Interface 12</p> <p>2.3.2 Selecting Rows and Columns; Boolean Operators 12</p> <p>2.3.3 Selection with Indices 13</p> <p>2.3.4 Data Frame Manipulation:The dplyr Package 14</p> <p>2.4 Special Values 15</p> <p>2.4.1 Missing Values 17</p> <p>2.5 Getting Data into and out of R 18</p> <p>2.5.1 File Paths in R 19</p> <p>2.5.2 Formats Provided by Packages 20</p> <p>2.5.3 Reading Data from a Database 20</p> <p>2.5.4 Working with Data External to R 21</p> <p>2.6 Functions 21</p> <p>2.6.1 Using Functions 22</p> <p>2.6.2 Writing Functions 22</p> <p>2.7 Packages Used in this Book 23</p> <p><b>3 Technical Representation of Data 27</b></p> <p>3.1 Numeric Data 28</p> <p>3.1.1 Integers 28</p> <p>3.1.2 Integers in R 30</p> <p>3.1.3 Real Numbers 31</p> <p>3.1.4 Double Precision Numbers 31</p> <p>3.1.5 The Concept of Machine Precision 33</p> <p>3.1.6 Consequences ofWorking with Floating Point Numbers 34</p> <p>3.1.7 Dealing with the Consequences 35</p> <p>3.1.8 Numeric Data in R 37</p> <p>3.2 Text Data 38</p> <p>3.2.1 Terminology and Encodings 38</p> <p>3.2.2 Unicode 39</p> <p>3.2.3 Some Popular Encodings 40</p> <p>3.2.4 Textual Data in R: Objects of Class Character 43</p> <p>3.2.5 Encoding in R 44</p> <p>3.2.6 Reading andWriting of Data with Non-Local Encoding 46</p> <p>3.2.7 Detecting Encoding 48</p> <p>3.2.8 Collation and Sorting 49</p> <p>3.3 Times and Dates 50</p> <p>3.3.1 AIT, UTC, and POSIX Seconds Since the Epcoch 50</p> <p>3.3.2 Time and Date Notation 52</p> <p>3.3.3 Time and Date Storage in R 54</p> <p>3.3.4 Time and Date Conversion in R 55</p> <p>3.3.5 Leap Days, Time Zones, and Daylight Saving Times 57</p> <p>3.4 Notes on Locale Settings 58</p> <p><b>4 Data Structure 61</b></p> <p>4.1 Introduction 61</p> <p>4.2 Tabular Data 61</p> <p>4.2.1 data.frame 61</p> <p>4.2.2 Databases 62</p> <p>4.2.3 dplyr 64</p> <p>4.3 Matrix Data 65</p> <p>4.4 Time Series 66</p> <p>4.5 Graph Data 68</p> <p>4.6 Web Data 69</p> <p>4.6.1 Web Scraping 69</p> <p>4.6.2 Web API 70</p> <p>4.7 Other Data 72</p> <p>4.8 Tidying Tabular Data 72</p> <p>4.8.1 Variable Per Column 74</p> <p>4.8.2 Single Observation Stored in Multiple Tables 75</p> <p><b>5 Cleaning Text Data 77</b></p> <p>5.1 Character Normalization 78</p> <p>5.1.1 Encoding Conversion and Unicode Normalization 78</p> <p>5.1.2 Character Conversion and Transliteration 80</p> <p>5.2 Pattern Matching with Regular Expressions 81</p> <p>5.2.1 Basic Regular Expressions 82</p> <p>5.2.2 Practical Regular Expressions 85</p> <p>5.2.3 Generating Regular Expressions in R 92</p> <p>5.3 Common String Processing Tasks in R 93</p> <p>5.4 Approximate Text Matching 98</p> <p>5.4.1 String Metrics 100</p> <p>5.4.2 String Metrics and Approximate Text Matching in R 109</p> <p><b>6 Data Validation 119</b></p> <p>6.1 Introduction 119</p> <p>6.2 A First Look at the validate Package 120</p> <p>6.2.1 Quick Checks with check_that 120</p> <p>6.2.2 The BasicWorkflow: validator and confront 122</p> <p>6.2.3 A Little Background on validate and DSLs 124</p> <p>6.3 Defining Data Validation 125</p> <p>6.3.1 Formal Definition of Data Validation 126</p> <p>6.3.2 Operations on Validation Functions 128</p> <p>6.3.3 Validation and Missing Values 130</p> <p>6.3.4 Structure of Validation Functions 131</p> <p>6.3.5 Demarcating Validation Rules in validate 132</p> <p>6.4 A Formal Typology of Data Validation Functions 134</p> <p>6.4.1 A Closer Look at Measurement 134</p> <p>6.4.2 Classification of Validation Rules 135</p> <p>6.5 Validating Data with the validate Package 137</p> <p>6.5.1 Validation Rules in the Console and the validator Object 137</p> <p>6.5.2 Validating in the Pipeline 139</p> <p>6.5.3 Raising Errors orWarnings 140</p> <p>6.5.4 Tolerance for Testing Linear Equalities 140</p> <p>6.5.5 Setting and Resetting Options 141</p> <p>6.5.6 Importing and Exporting Validation Rules from and to File 142</p> <p>6.5.7 Checking Variable Types and Metadata 145</p> <p>6.5.8 Checking Value Ranges and Code Lists 146</p> <p>6.5.9 Checking In-Record Consistency Rules 146</p> <p>6.5.10 Checking Cross-Record Validation Rules 148</p> <p>6.5.11 Checking Functional Dependencies 149</p> <p>6.5.12 Cross-Dataset Validation 150</p> <p>6.5.13 Macros, Variable Groups, Keys 152</p> <p>6.5.14 Analyzing Output: validation Objects 152</p> <p>6.5.15 Output Dimensionality and Output Selection 155</p> <p><b>7 Localizing Errors in Data Records 157</b></p> <p>7.1 Error Localization 157</p> <p>7.2 Error Localization with R 160</p> <p>7.2.1 The Errorlocate Package 160</p> <p>7.3 Error Localization as MIP-Problem 163</p> <p>7.3.1 Error Localization and Mixed-Integer Programming 163</p> <p>7.3.2 Linear Restrictions 164</p> <p>7.3.3 Categorical Restrictions 165</p> <p>7.3.4 Mixed-Type Restrictions 167</p> <p>7.4 Numerical Stability Issues 170</p> <p>7.4.1 A Short Overview of MIP Solving 170</p> <p>7.4.2 Scaling Numerical Records 172</p> <p>7.4.3 Setting NumericalThreshold Values 173</p> <p>7.5 Practical Issues 174</p> <p>7.5.1 Setting ReliabilityWeights 174</p> <p>7.5.2 Simplifying Conditional Validation Rules 176</p> <p>7.6 Conclusion 180</p> <p><b>8 Rule Set Maintenance and Simplification 183</b></p> <p>8.1 Quality of Validation Rules 183</p> <p>8.1.1 Completeness 183</p> <p>8.1.2 Superfluous Rules and Infeasibility 184</p> <p>8.2 Rules in the Language of Logic 184</p> <p>8.2.1 Using Logic to Rewrite Rules 185</p> <p>8.3 Rule Set Issues 186</p> <p>8.3.1 Infeasible Rule Set 186</p> <p>8.3.2 Fixed Value 187</p> <p>8.3.3 Redundant Rule 188</p> <p>8.3.4 Nonrelaxing Clause 189</p> <p>8.3.5 Nonconstraining Clause 189</p> <p>8.4 Detection and Simplification Procedure 190</p> <p>8.4.1 Mixed-Integer Programming 190</p> <p>8.4.2 Detecting Feasibility 191</p> <p>8.4.3 Finding Rules Causing Infeasibility 191</p> <p>8.4.4 Detecting Conflicting Rules 191</p> <p>8.4.5 Detect Partial Infeasibility 192</p> <p>8.4.6 Detect Fixed Values 192</p> <p>8.4.7 Detect Nonrelaxing Clauses 192</p> <p>8.4.8 Detecting Nonconstraining Clauses 193</p> <p>8.4.9 Detecting Redundant Rules 193</p> <p>8.5 Conclusion 194</p> <p><b>9 Methods Based on Models for Domain Knowledge 195</b></p> <p>9.1 Correction with Data Modifying Rules 195</p> <p>9.1.1 Modifying Functions 196</p> <p>9.1.2 A Class of Modifying Functions on Numerical Data 201</p> <p>9.2 Rule-Based Correction with dcmodify 205</p> <p>9.2.1 Reading Rules from File 206</p> <p>9.2.2 Modifying Rule Syntax 207</p> <p>9.2.3 Missing Values 208</p> <p>9.2.4 Sequential and Sequence-Independent Execution 208</p> <p>9.2.5 Options Settings Management 209</p> <p>9.3 Deductive Correction 209</p> <p>9.3.1 Correcting Typing Errors in Numeric Data 209</p> <p>9.3.2 Deductive Imputation Using Linear Restrictions 213</p> <p><b>10 Imputation and Adjustment 219</b></p> <p>10.1 Missing Data 219</p> <p>10.1.1 Missing Data Mechanisms 219</p> <p>10.1.2 Visualizing and Testing for Patterns in Missing Data Using R 220</p> <p>10.2 Model-Based Imputation 224</p> <p>10.3 Model-Based Imputation in R 226</p> <p>10.3.1 Specifying ImputationMethods with simputation 226</p> <p>10.3.2 Linear Regression-Based Imputation 227</p> <p>10.3.3 M-Estimation 230</p> <p>10.3.4 Lasso, Ridge, and Elasticnet Regression 231</p> <p>10.3.5 Classification and Regression Trees 232</p> <p>10.3.6 Random Forest 235</p> <p>10.4 Donor Imputation with R 236</p> <p>10.4.1 Random and Sequential Hot Deck Imputation 237</p> <p>10.4.2 k Nearest Neighbors and Predictive Mean Matching 238</p> <p>10.5 Other Methods in the simputation Package 239</p> <p>10.6 Imputation Based on the EM Algorithm 240</p> <p>10.6.1 The EM Algorithm 241</p> <p>10.6.2 EM Imputation Assuming the Multivariate Normal Distribution 243</p> <p>10.7 Sampling Variance under Imputation 244</p> <p>10.8 Multiple Imputations 246</p> <p>10.8.1 Multiple Imputation Based on the EM Algorithm 248</p> <p>10.8.2 The Amelia Package 249</p> <p>10.8.3 Multivariate Imputation with Chained Equations (Mice) 252</p> <p>10.8.4 Imputation with the mice Package 254</p> <p>10.9 Analytic Approaches to Estimate Variance of Imputation 256</p> <p>10.9.1 Imputation as Part of the Estimator 256</p> <p>10.10 Choosing an ImputationMethod 257</p> <p>10.11 Constraint Value Adjustment 259</p> <p>10.11.1 Formal Description 259</p> <p>10.11.2 Application to Imputed Data 262</p> <p>10.11.3 Adjusting Imputed Values with the rspa Package 263</p> <p><b>11 Example: A Small Data-Cleaning System 265</b></p> <p>11.1 Setup 266</p> <p>11.1.1 DeterministicMethods 266</p> <p>11.1.2 Error Localization 269</p> <p>11.1.3 Imputation 269</p> <p>11.1.4 Adjusting Imputed Data 271</p> <p>11.2 Monitoring Changes in Data 273</p> <p>11.2.1 Data Diff (Daff) 274</p> <p>11.2.2 Summarizing Cell Changes 275</p> <p>11.2.3 Summarizing Changes in Conformance to Validation Rules 277</p> <p>11.2.4 Track Changes in Data Automatically with lumberjack 278</p> <p>11.3 Integration and Automation 282</p> <p>11.3.1 Using RScript 283</p> <p>11.3.2 The docopt Package 283</p> <p>11.3.3 Automated Data Cleaning 285</p> <p>References 287</p> <p>Index 297</p>
<p> <strong>Mark van der Loo and Edwin de Jonge,</strong> Department of Statistical Methods, Statistics Netherlands, The Netherlands
<p><b>A comprehensive guide to automated statistical data cleaning</b></p> <p>The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. <i>Statistical Data Cleaning with Applications in R</i> brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy.</p> <p><b>Key features:</b></p> <ul> <li>Focuses on the automation of data cleaning methods, including both theory and applications written in R.</li> <li>Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis.</li> <li>Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring.</li> <li>Supported by an accompanying website featuring data and R code.</li> </ul> <p><i>Statistical Data Cleaning with Applications in R</i> enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. This book can also be used as material for courses in both data cleaning and data analysis.</p>

Diese Produkte könnten Sie auch interessieren:

Quantifiers in Action
Quantifiers in Action
von: Antonio Badia
PDF ebook
CHF 118.00
Managing and Mining Uncertain Data
Managing and Mining Uncertain Data
von: Charu C. Aggarwal
PDF ebook
CHF 118.00