Practical Data Science with R, Second Edition /
Practical Data Science with R, Second Edition is a task-based tutorial that leads readers through dozens of useful, data analysis practices using the R language. By concentrating on the most important tasks you'll face on the job, this friendly guide is comfortable both for business analysts an...
Autores principales: | , |
---|---|
Autor Corporativo: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Manning Publications,
2019.
|
Edición: | 2nd edition. |
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Intro
- Practical Data Science with R, Second Edition
- Nina Zumel and John Mount
- Copyright
- Dedication
- Brief Table of Contents
- Table of Contents
- Praise for the First Edition
- front matter
- Foreword
- Preface
- Acknowledgments
- About This Book
- What is data science?
- Roadmap
- Audience
- What is not in this book?
- Code conventions and downloads
- Working with this book
- Downloading the book's supporting materials/repository
- Book forum
- About the Authors
- About the Foreword Authors
- About the Cover Illustration
- Part 1. Introduction to data science
- Chapter 1. The data science process
- 1.1. The roles in a data science project
- 1.1.1. Project roles
- 1.2. Stages of a data science project
- 1.2.1. Defining the goal
- 1.2.2. Data collection and management
- 1.2.3. Modeling
- 1.2.4. Model evaluation and critique
- 1.2.5. Presentation and documentation
- 1.2.6. Model deployment and maintenance
- 1.3. Setting expectations
- 1.3.1. Determining lower bounds on model performance
- Summary
- Chapter 2. Starting with R and data
- 2.1. Starting with R
- 2.1.1. Installing R, tools, and examples
- 2.1.2. R programming
- 2.2. Working with data from files
- 2.2.1. Working with well-structured data from files or URLs
- 2.2.2. Using R with less-structured data
- 2.3. Working with relational databases
- 2.3.1. A production-size example
- Summary
- Chapter 3. Exploring data
- 3.1. Using summary statistics to spot problems
- 3.1.1. Typical problems revealed by data summaries
- 3.2. Spotting problems using graphics and visualization
- 3.2.1. Visually checking distributions for a single variable
- 3.2.2. Visually checking relationships between two variables
- Summary
- Chapter 4. Managing data
- 4.1. Cleaning data
- 4.1.1. Domain-specific data cleaning
- 4.1.2. Treating missing values.
- 4.1.3. The vtreat package for automatically treating missing variables
- 4.2. Data transformations
- 4.2.1. Normalization
- 4.2.2. Centering and scaling
- 4.2.3. Log transformations for skewed and wide distributions
- 4.3. Sampling for modeling and validation
- 4.3.1. Test and training splits
- 4.3.2. Creating a sample group column
- 4.3.3. Record grouping
- 4.3.4. Data provenance
- Summary
- Chapter 5. Data engineering and data shaping
- 5.1. Data selection
- 5.1.1. Subsetting rows and columns
- 5.1.2. Removing records with incomplete data
- 5.1.3. Ordering rows
- 5.2. Basic data transforms
- 5.2.1. Adding new columns
- 5.2.2. Other simple operations
- 5.3. Aggregating transforms
- 5.3.1. Combining many rows into summary rows
- 5.4. Multitable data transforms
- 5.4.1. Combining two or more ordered data frames quickly
- 5.4.2. Principal methods to combine data from multiple tables
- 5.5. Reshaping transforms
- 5.5.1. Moving data from wide to tall form
- 5.5.2. Moving data from tall to wide form
- 5.5.3. Data coordinates
- Summary
- Part 2. Modeling methods
- Chapter 6. Choosing and evaluating models
- 6.1. Mapping problems to machine learning tasks
- 6.1.1. Classification problems
- 6.1.2. Scoring problems
- 6.1.3. Grouping: working without known targets
- 6.1.4. Problem-to-method mapping
- 6.2. Evaluating models
- 6.2.1. Overfitting
- 6.2.2. Measures of model performance
- 6.2.3. Evaluating classification models
- 6.2.4. Evaluating scoring models
- 6.2.5. Evaluating probability models
- 6.3. Local interpretable model-agnostic explanations (LIME) for explai- ining model predictions
- 6.3.1. LIME: Automated sanity checking
- 6.3.2. Walking through LIME: A small example
- 6.3.3. LIME for text classification
- 6.3.4. Training the text classifier
- 6.3.5. Explaining the classifier's predictions
- Summary.
- Chapter 7. Linear and logistic regression
- 7.1. Using linear regression
- 7.1.1. Understanding linear regression
- 7.1.2. Building a linear regression model
- 7.1.3. Making predictions
- 7.1.4. Finding relations and extracting advice
- 7.1.5. Reading the model summary and characterizing coefficient quality
- 7.1.6. Linear regression takeaways
- 7.2. Using logistic regression
- 7.2.1. Understanding logistic regression
- 7.2.2. Building a logistic regression model
- 7.2.3. Making predictions
- 7.2.4. Finding relations and extracting advice from logistic models
- 7.2.5. Reading the model summary and characterizing coefficients
- 7.2.6. Logistic regression takeaways
- 7.3. Regularization
- 7.3.1. An example of quasi-separation
- 7.3.2. The types of regularized regression
- 7.3.3. Regularized regression with glmnet
- Summary
- Chapter 8. Advanced data preparation
- 8.1. The purpose of the vtreat package
- 8.2. KDD and KDD Cup 2009
- 8.2.1. Getting started with KDD Cup 2009 data
- 8.2.2. The bull-in-the-china-shop approach
- 8.3. Basic data preparation for classification
- 8.3.1. The variable score frame
- 8.4. Advanced data preparation for classification
- 8.4.1. Using mkCrossFrameCExperiment()
- 8.4.2. Building a model
- Building a multivariable model
- Evaluating the model
- 8.5. Preparing data for regression modeling
- 8.6. Mastering the vtreat package
- 8.6.1. The vtreat phases
- 8.6.2. Missing values
- 8.6.3. Indicator variables
- 8.6.4. Impact coding
- 8.6.5. The treatment plan
- 8.6.6. The cross-frame
- Summary
- Chapter 9. Unsupervised methods
- 9.1. Cluster analysis
- 9.1.1. Distances
- 9.1.2. Preparing the data
- 9.1.3. Hierarchical clustering with hclust
- 9.1.4. The k-means algorithm
- 9.1.5. Assigning new points to clusters
- 9.1.6. Clustering takeaways
- 9.2. Association rules.
- 9.2.1. Overview of association rules
- 9.2.2. The example problem
- 9.2.3. Mining association rules with the arules package
- 9.2.4. Association rule takeaways
- Summary
- Chapter 10. Exploring advanced methods
- 10.1. Tree-based methods
- 10.1.1. A basic decision tree
- 10.1.2. Using bagging to improve prediction
- 10.1.3. Using random forests to further improve prediction
- 10.1.4. Gradient-boosted trees
- 10.1.5. Tree-based model takeaways
- 10.2. Using generalized additive models (GAMs) to learn non-monotone relationships
- 10.2.1. Understanding GAMs
- 10.2.2. A one-dimensional regression example
- 10.2.3. Extracting the non-linear relationships
- 10.2.4. Using GAM on actual data
- 10.2.5. Using GAM for logistic regression
- 10.2.6. GAM takeaways
- 10.3. Solving "inseparable" problems using support vector machines
- 10.3.1. Using an SVM to solve a problem
- 10.3.2. Understanding support vector machines
- 10.3.3. Understanding kernel functions
- 10.3.4. Support vector machine and kernel methods takeaways
- Summary
- Part 3. Working in the real world
- Chapter 11. Documentation and deployment
- 11.1. Predicting buzz
- 11.2. Using R markdown to produce milestone documentation
- 11.2.1. What is R markdown?
- 11.2.2. knitr technical details
- 11.2.3. Using knitr to document the Buzz data and produce the model
- 11.3. Using comments and version control for running documentation
- 11.3.1. Writing effective comments
- 11.3.2. Using version control to record history
- 11.3.3. Using version control to explore your project
- 11.3.4. Using version control to share work
- 11.4. Deploying models
- 11.4.1. Deploying demonstrations using Shiny
- 11.4.2. Deploying models as HTTP services
- 11.4.3. Deploying models by export
- 11.4.4. What to take away
- Summary
- Chapter 12. Producing effective presentations.
- 12.1. Presenting your results to the project sponsor
- 12.1.1. Summarizing the project's goals
- 12.1.2. Stating the project's results
- 12.1.3. Filling in the details
- 12.1.4. Making recommendations and discussing future work
- 12.1.5. Project sponsor presentation takeaways
- 12.2. Presenting your model to end users
- 12.2.1. Summarizing the project goals
- 12.2.2. Showing how the model fits user workflow
- 12.2.3. Showing how to use the model
- 12.2.4. End user presentation takeaways
- 12.3. Presenting your work to other data scientists
- 12.3.1. Introducing the problem
- 12.3.2. Discussing related work
- 12.3.3. Discussing your approach
- 12.3.4. Discussing results and future work
- 12.3.5. Peer presentation takeaways
- Summary
- Appendix A. Starting with R and other tools
- A.1. Installing the tools
- A.1.1. Installing Tools
- A.1.2. The R package system
- A.1.3. Installing Git
- A.1.4. Installing RStudio
- A.1.5. R resources
- A.2. Starting with R
- A.2.1. Primary features of R
- A.2.2. Primary R data types
- A.3. Using databases with R
- A.3.1. Running database queries using a query generator
- A.3.2. How to think relationally about data
- A.4. The takeaway
- Appendix B. Important statistical concepts
- B.1. Distributions
- B.1.1. Normal distribution
- B.1.2. Summarizing R's distribution naming conventions
- B.1.3. Lognormal distribution
- B.1.4. Binomial distribution
- B.1.5. More R tools for distributions
- B.2. Statistical theory
- B.2.1. Statistical philosophy
- B.2.2. A/B tests
- B.2.3. Power of tests
- B.2.4. Specialized statistical tests
- B.3. Examples of the statistical view of data
- B.3.1. Sampling bias
- B.3.2. Omitted variable bias
- B.4. The takeaway
- Appendix C. Bibliography
- Practical Data Science with R
- Index
- List of Figures
- List of Tables
- List of Listings.