Applied text analysis with Python : enabling language-aware data products with machine learning /
From news and speeches to informal chatter on social media, natural language is one of the richest and most underutilized sources of data. Not only does it come in a constant stream, always changing and adapting in context; it also contains information that is not conveyed by traditional data source...
Clasificación: | Libro Electrónico |
---|---|
Autores principales: | , , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Sebastopol, CA :
O'Reilly Media,
[2018]
|
Edición: | First edition. |
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- 1. Language and computation
- 2. Building a custom corpus
- 3. Corpus preprocessing and wrangling
- 4. Text vectorization and transformation pipelines
- 5. Classification for text analysis
- 6. Clustering for text similarity
- 7. Context-aware text analysis
- 8. Text visualization
- 9. Graph analysis of text
- 10. Chatbots
- 11. Scaling text analytics with multiprocessing and spark
- 12. Deep learning and beyond.
- Cover; Copyright; Table of Contents; Preface; Computational Challenges of Natural Language; Linguistic Data: Tokens and Words; Enter Machine Learning; Tools for Text Analysis; What to Expect from This Book; Who This Book Is For; Code Examples and GitHub Repository; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Chapter 1. Language and Computation; The Data Science Paradigm; Language-Aware Data Products; The Data Product Pipeline; Language as Data; A Computational Model of Language; Language Features; Contextual Features.
- Structural FeaturesConclusion; Chapter 2. Building a Custom Corpus; What Is a Corpus?; Domain-Specific Corpora; The Baleen Ingestion Engine; Corpus Data Management; Corpus Disk Structure; Corpus Readers; Streaming Data Access with NLTK; Reading an HTML Corpus; Reading a Corpus from a Database; Conclusion; Chapter 3. Corpus Preprocessing and Wrangling; Breaking Down Documents; Identifying and Extracting Core Content; Deconstructing Documents into Paragraphs; Segmentation: Breaking Out Sentences; Tokenization: Identifying Individual Tokens; Part-of-Speech Tagging; Intermediate Corpus Analytics.
- Corpus TransformationIntermediate Preprocessing and Storage; Reading the Processed Corpus; Conclusion; Chapter 4. Text Vectorization and Transformation Pipelines; Words in Space; Frequency Vectors; One-Hot Encoding; Term Frequency-Inverse Document Frequency; Distributed Representation; The Scikit-Learn API; The BaseEstimator Interface; Extending TransformerMixin; Pipelines; Pipeline Basics; Grid Search for Hyperparameter Optimization; Enriching Feature Extraction with Feature Unions; Conclusion; Chapter 5. Classification for Text Analysis; Text Classification.
- Identifying Classification ProblemsClassifier Models; Building a Text Classification Application; Cross-Validation; Model Construction; Model Evaluation; Model Operationalization; Conclusion; Chapter 6. Clustering for Text Similarity; Unsupervised Learning on Text; Clustering by Document Similarity; Distance Metrics; Partitive Clustering; Hierarchical Clustering; Modeling Document Topics; Latent Dirichlet Allocation; Latent Semantic Analysis; Non-Negative Matrix Factorization; Conclusion; Chapter 7. Context-Aware Text Analysis; Grammar-Based Feature Extraction; Context-Free Grammars.
- Syntactic ParsersExtracting Keyphrases; Extracting Entities; n-Gram Feature Extraction; An n-Gram-Aware CorpusReader; Choosing the Right n-Gram Window; Significant Collocations; n-Gram Language Models; Frequency and Conditional Frequency; Estimating Maximum Likelihood; Unknown Words: Back-off and Smoothing; Language Generation; Conclusion; Chapter 8. Text Visualization; Visualizing Feature Space; Visual Feature Analysis; Guided Feature Engineering; Model Diagnostics; Visualizing Clusters; Visualizing Classes; Diagnosing Classification Error; Visual Steering; Silhouette Scores and Elbow Curves.