Cargando…

Modern Scala projects : leverage the power of Scala for building data-driven and high-performant projects /

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Gurusamy, Ilango (Autor)
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Birmingham, UK : Packt Publishing, 2018.
Temas:
Acceso en línea:Texto completo
Texto completo
Tabla de Contenidos:
  • Cover; Title Page; Copyright and Credits; Packt Upsell; Contributors; Table of Contents; Preface; Chapter 1: Predict the Class of a Flower from the Iris Dataset; A multivariate classification problem; Understanding multivariate; Different kinds of variables; Categorical variables; Fischer's Iris dataset; The Iris dataset represents amulticlass, multidimensional classification task; The training dataset; The mapping function; An algorithm and its mapping function; Supervised learning
  • how it relates to the Iris classification task; Random Forest classification algorithm
  • Project overview
  • problem formulationGetting started with Spark; Setting up prerequisite software; Installing Spark in standalone deploy mode; Developing a simple interactive data analysis utility; Reading a data file and deriving DataFrame out of it; Implementing the Iris pipeline; Iris pipeline implementation objectives; Step 1- getting the Iris dataset from the UCI Machine Learning Repository; Step 2- preliminary EDA; Firing up Spark shell; Loading the iris.csv file and building a DataFrame; Calculating statistics; Inspecting your SparkConf again; Calculating statistics again
  • Step 3- creating an SBT projectStep 4
  • creating Scala files in SBT project; Step 5
  • preprocessing, data transformation, and DataFrame creation; DataFrame Creation; Step 6
  • creating, training, and testing data; Step 7
  • creating a Random Forest classifier; Step 8
  • training the Random Forest classifier; Step 9
  • applying the Random Forest classifier to test data; Step 10
  • evaluate Random Forest classifier; Step 11
  • running the pipeline as an SBT application; Step 12
  • packaging the application; Step 13
  • submitting the pipeline application to Spark local; Summary; Questions
  • Chapter 2: Build a Breast Cancer Prognosis Pipeline with the Power of Spark and ScalaBreast cancer classification problem; Breast cancer dataset at a glance; Logistic regression algorithm; Salient characteristics of LR; Binary logistic regression assumptions; A fictitious dataset and LR; LR as opposed to linear regression; Formulation of a linear regression classification model; Logit function as a mathematical equation; LR function; Getting started; Setting up prerequisite software; Implementation objectives; Implementation objective 1
  • getting the breast cancer dataset
  • Implementation objective 2- deriving a dataframe for EDAStep 1
  • conducting preliminaryEDA; Step 2
  • loading data and converting it to an RDD[String]; Step 3
  • splitting the resilient distributed dataset and reorganizing individual rows into an array; Step 4
  • purging the dataset of rows containing question mark characters; Step 5
  • running a count after purging the dataset of rows with questionable characters; Step 6
  • getting rid of header; Step 7
  • creating a two-column DataFrame; Step 8
  • creating the final DataFrame; Random Forest breast cancer pipeline