Cargando…

Exploring newspaper language : using the web to create and investigate a large corpus of modern Norwegian /

This book describes new methodological and technological approaches to corpus building and presents recent research based on the Norwegian Newspaper Corpus. This is a large monitor corpus of contemporary Norwegian language, compiled through daily harvesting of web newspapers. The book gives an overv...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Otros Autores: Andersen, Gisle
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Amsterdam ; Philadelphia : John Benjamins Pub. Co., 2012.
Colección:Studies in corpus linguistics ; v. 49.
Temas:
Acceso en línea:Texto completo
Tabla de Contenidos:
  • Exploring Newspaper Language; Editorial page; Titla page; LCC data; Table of contents; Building a large corpus based on newspapers from the web; 1. Introduction; 2. An overview of the Norwegian Newspaper Corpus and its system architecture; 2.1 Text harvesting; 2.2 Boilerplate and duplicate removal; 2.3 Language classification; 2.4 Text annotation; 2.4.1 Annotation of source, date and author information; 2.4.2 Topic classification; 2.4.3 Part-of-speech tagging; 2.5 Search system and user interface; 2.5.1 Corpus WorkBench; 2.5.2 Corpuscle; 2.6 Extraction of new words.
  • 2.7 Classification of new words2.7.1 Anglicism detection; 2.8 Frequency profiling and lexical database entry; 2.9 Identification of multiword expressions; 3. The content of the research contributions to this book; 4. Concluding remarks; References; Part II. Exploiting the web as a corpus
  • Methods and tools; Corpuscle
  • a new corpus management platform for annotated corpora; 1. Introduction; 2. Design principles; 3. Querying the corpus; 4. API and Web interface; 4.1 The API; 4.2 The Web interface; 5. Editing and manual annotation; 6. Evaluation and concluding remarks; References; OBT+stat.
  • 1. Introduction2. Background; 2.1 The history of the Oslo-Bergen Tagger; 2.2 State of the art for Norwegian POS taggers; 3. The architecture of the Oslo-Bergen Constraint Grammar Tagger; 4. Methodology of improvements to the Oslo-Bergen Tagger; 5. Dealing with left-over ambiguities in the Oslo-Bergen Tagger; 5.1 Morphological ambiguities; 5.2 Lemma ambiguities; 6. Statistical disambiguation; 7. Modelling challenges and engineering concerns; 8. Evaluation of the statistical module; 8.1 How to evaluate; 8.2 Evaluation results; 9. Conclusion; References.
  • Exploring corpora through syntactic annotation1. Introduction; 2. Treebanking; 3. INESS
  • the Norwegian treebanking infrastructure; 4. Searching for complex syntactic constructions in a treebank; 4.1 Passive constructions; 4.2 Relative clauses; 5. Conclusion; References; Collocations and statistical analysis of n-grams; 1. Introduction; 2. Background; 2.1 Multiword Expressions (MWEs); 2.2 Collocations; 3. Methodology; 3.1 Data and n-gram extraction; 3.2 Post-processing of n-gram lists; 3.3 Contingency tables; 3.3.1 Bigram Contingency Tables; 3.3.2 Trigram Contingency Tables.
  • 3.4 Bigram Association Measures3.5 Trigram Association Measures; 4. Results; 4.1 Bigrams; 4.2 Trigrams; 5. Conclusion and Future Work; References; Automatic topic classi?cation of a large newspaper corpus; 1. Introduction; 2. Background and related work; 2.1 The rule-based approach; 2.2 The pattern-matching approach; 2.3 Promising results; 3. Material; 3.1 Manual annotation; 3.2 Feature extraction; 3.3 Cleaning the text; 3.4 The gold standard; 4. Overview of our final approach; 5. Our approach in detail; 5.1 Hypothesis; 5.2 De?ning categories; 5.3 Tools; 5.4 Programming and experimenting.