Mastering Spark with R : the complete guide to large-scale analysis and modeling /
"Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to combine R with Spark to analyze data at scale. This book covers relevant data science topics, cluster computing, and issues that will interest even the most advanced users."--Back cover
Clasificación: | Libro Electrónico |
---|---|
Autores principales: | , , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Sebastopol, CA :
O'Reilly Media,
[2019]
|
Edición: | First edition. |
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Intro; Copyright; Table of Contents; Foreword; Preface; Formatting; Acknowledgments; Conventions Used in This Book; Using Code Examples; O'Reilly Online Learning; How to Contact Us; Chapter 1. Introduction; Overview; Hadoop; Spark; R; sparklyr; Recap; Chapter 2. Getting Started; Overview; Prerequisites; Installing sparklyr; Installing Spark; Connecting; Using Spark; Web Interface; Analysis; Modeling; Data; Extensions; Distributed R; Streaming; Logs; Disconnecting; Using RStudio; Resources; Recap; Chapter 3. Analysis; Overview; Import; Wrangle; Built-in Functions; Correlations; Visualize
- Using ggplot2Using dbplot; Model; Caching; Communicate; Recap; Chapter 4. Modeling; Overview; Exploratory Data Analysis; Feature Engineering; Supervised Learning; Generalized Linear Regression; Other Models; Unsupervised Learning; Data Preparation; Topic Modeling; Recap; Chapter 5. Pipelines; Overview; Creation; Use Cases; Hyperparameter Tuning; Operating Modes; Interoperability; Deployment; Batch Scoring; Real-Time Scoring; Recap; Chapter 6. Clusters; Overview; On-Premises; Managers; Distributions; Cloud; Amazon; Databricks; Google; IBM; Microsoft; Qubole; Kubernetes; Tools; RStudio; Jupyter
- LivyRecap; Chapter 7. Connections; Overview; Edge Nodes; Spark Home; Local; Standalone; YARN; YARN Client; YARN Cluster; Livy; Mesos; Kubernetes; Cloud; Batches; Tools; Multiple Connections; Troubleshooting; Logging; Spark Submit; Windows; Recap; Chapter 8. Data; Overview; Reading Data; Paths; Schema; Memory; Columns; Writing Data; Copying Data; File Formats; CSV; JSON; Parquet; Others; File Systems; Storage Systems; Hive; Cassandra; JDBC; Recap; Chapter 9. Tuning; Overview; Graph; Timeline; Configuring; Connect Settings; Submit Settings; Runtime Settings; sparklyr Settings; Partitioning
- Implicit PartitionsExplicit Partitions; Caching; Checkpointing; Memory; Shuffling; Serialization; Configuration Files; Recap; Chapter 10. Extensions; Overview; H2O; Graphs; XGBoost; Deep Learning; Genomics; Spatial; Troubleshooting; Recap; Chapter 11. Distributed R; Overview; Use Cases; Custom Parsers; Partitioned Modeling; Grid Search; Web APIs; Simulations; Partitions; Grouping; Columns; Context; Functions; Packages; Cluster Requirements; Installing R; Apache Arrow; Troubleshooting; Worker Logs; Resolving Timeouts; Inspecting Partitions; Debugging Workers; Recap; Chapter 12. Streaming
- OverviewTransformations; Analysis; Modeling; Pipelines; Distributed R; Kafka; Shiny; Recap; Chapter 13. Contributing; Overview; The Spark API; Spark Extensions; Using Scala Code; Recap; Appendix A. Supplemental Code References; Preface; Formatting; Chapter 1; The World's Capacity to Store Information; Daily Downloads of CRAN Packages; Chapter 2; Prerequisites; Chapter 3; Hive Functions; Chapter 4; MLlib Functions; Chapter 6; Google Trends for On-Premises (Mainframes), Cloud Computing, and Kubernetes; Chapter 12; Stream Generator; Installing Kafka; Index; About the Authors; Colophon