Exploring Textual DataSpringer Science & Business Media, 31 déc. 1997 - 247 pages Researchers in a number of disciplines deal with large text sets requiring both text management and text analysis. Faced with a large amount of textual data collected in marketing surveys, literary investigations, historical archives and documentary data bases, these researchers require assistance with organizing, describing and comparing texts. Exploring Textual Data demonstrates how exploratory multivariate statistical methods such as correspondence analysis and cluster analysis can be used to help investigate, assimilate and evaluate textual data. The main text does not contain any strictly mathematical demonstrations, making it accessible to a large audience. This book is very user-friendly with proofs abstracted in the appendices. Full definitions of concepts, implementations of procedures and rules for reading and interpreting results are fully explored. A succession of examples is intended to allow the reader to appreciate the variety of actual and potential applications and the complementary processing methods. A glossary of terms is provided. |
Table des matières
TEXTUAL STATISTICS SCOPE AND APPLICATIONS | ix |
111 The linguistic viewpoint | x |
112 Content analysis | xi |
121 Pioneering works | 1 |
A STATISTICIANS VIEWPOINT | 2 |
132 Internal and external information metadata | 3 |
RESPONSES TO OPEN QUESTIONS | 6 |
a research tool | 7 |
512 Aggregated lexical tables | 95 |
513 Frequency threshold for words | 96 |
515 Construction of aggregated lexical and segmental table | 100 |
516 Analysis and interpretation of lexical tables | 103 |
517 Illustration of displays using repeated segments | 107 |
52 WORKING DEMOGRAPHIC PARTITIONS | 110 |
53 DIRECT ANALYSIS OF RESPONSES OR DOCUMENTS | 113 |
531 How are distances interpreted? | 114 |
142 Manual postcoding of free responses | 9 |
groups of responses | 10 |
THE UNITS OF TEXTUAL STATISTICS | 13 |
211 Computerized text | 14 |
213 Lemmatized analyses | 15 |
214 Semantically based approaches | 16 |
215 Brief comparison with other languages | 17 |
22 SEGMENTATION AND NUMERIC CODING OF TEXT | 18 |
221 Numeric coding of Life corpus | 19 |
222 Corpus P | 20 |
232 Zipfs law | 21 |
24 LEXICOMETRIC DOCUMENTS | 23 |
241 Index of a corpus | 24 |
243 Vocabulary growth | 26 |
244 Lexical tables | 27 |
251 Sentences sequences | 28 |
252 Repeated segments table | 29 |
26 FINDING COOCCURRENCES QUASISEGMENTS | 31 |
262 Finding multiple cooccurrences quasisegments | 32 |
272 Comparison of main quantitative characteristics | 33 |
CORRESPONDENCE ANALYSIS OF LEXICAL TABLES | 37 |
31 BASIC PRINCIPLES OF MULTIVARIATE DESCRIPTIVE METHODS | 38 |
32 CORRESPONDENCE ANALYSIS | 39 |
323 Validity of the representation | 47 |
324 Active and supplementary variables | 52 |
325 A comparison with principal components analysis | 55 |
33 MULTIPLE CORRESPONDENCE ANALYSIS | 61 |
331 Basic structure of a survey sample | 63 |
332 Validity of the representation | 68 |
333 Positioning of supplementary variables | 70 |
CLUSTER ANALYSIS OF WORDS AND TEXTS | 73 |
41 REVIEW OF HIERARCHICAL CLUSTER ANALYSIS | 74 |
411 The dendrogram | 75 |
412 Cutting the dendrogram | 76 |
413 Appending supplementary elements | 77 |
414 Filtering on first principal axes | 78 |
421 Cluster analysis of words | 79 |
422 Cluster analysis of texts | 82 |
423 Notes on cluster analysis of words | 83 |
43 CLUSTER ANALYSIS OF SURVEY DATA SETS | 86 |
431 Mixed clustering algorithms | 87 |
432 Sequence of operations in survey analysis | 88 |
working demographic partition | 89 |
VISUALIZATION OF TEXTUAL DATA | 93 |
51 CORRESPONDENCE ANALYSIS OF LEXICAL TABLES | 94 |
532 Analysis of sparse matrix T | 115 |
533 Application example | 116 |
CHARACTERISTIC TEXTUAL UNITS MODAL RESPONSES AND MODAL TEXTS | 121 |
61 CHARACTERISTIC ELEMENTS | 122 |
612 List of characteristic units | 126 |
62 MODAL RESPONSES | 128 |
621 Selection of modal responses using characteristic elements | 129 |
622 Selection of modal responses using chisquare distances | 132 |
623 Implementation and examples | 133 |
LONGITUDINAL PARTITIONS TEXTUAL TIME SERIES | 139 |
711 Longitudinal partitioning example | 140 |
712 Analysis of age category gradation | 141 |
713 Adjacent characteristic elements | 142 |
72 TEXTUAL TIME SERIES | 145 |
722 Chronological characteristic elements | 147 |
723 Characteristic increments | 149 |
724 Parallel analysis of a lemmatized corpus | 153 |
TEXTUAL DISCRIMINANT ANALYSIS | 155 |
81 TWO MAJOR AREAS OF CONCERN IN TEXTUAL ANALYSIS | 156 |
information retrieval coding validation | 157 |
82 UNITS AND INDICES OF STYLOMETRY | 158 |
821 Function words speech parts | 159 |
822 Richness of vocabulary | 160 |
AN EXAMPLE | 161 |
832 Available data for attribution problems | 162 |
833 Other approaches to the problem | 165 |
84 GLOBAL DISCRIMINANT ANALYSIS | 166 |
841 General principles | 167 |
842 Units for global discriminant analysis | 169 |
844 Discriminant analysis regularized through preliminary correspondence analysis | 171 |
85 GLOBAL DISCRIMINATION AND VALIDATION | 173 |
852 Vocabulary and analysis for Tokyo | 177 |
853 Reality of patterns | 184 |
854 Discriminant analysis and confusion matrices | 185 |
855 Conclusions to section 85 | 191 |
Singular value decomposition and correspondence analysis | 192 |
Clustering techniques | 203 |
More details about the nonparametric estimation model | 211 |
Search for repeated segments in a corpus | 213 |
Glossary | 216 |
References | 221 |
230 | |
Subject Index | 234 |
238 | |
Autres éditions - Tout afficher
Expressions et termes fréquents
age categories agglomerated algorithm applications associated axes methods axis basic calculated chapter characteristic elements characteristic increments characteristic words chi-square distance closed-end question cluster analysis column-points contingency table coordinates corpus correspondence analysis counts criterion cross-tabulations defined dendrogram diagnostics displays distinct words educational level eigenvalues example figure frequency distribution function words gender graphical forms groups hierarchical cluster individuals information retrieval interpretation learning sample Lebart lemmatization lexical profiles lexical tables lexicometric linguistics matrix methods modal responses multiple correspondence analysis number of occurrences number of words numeric coding obtained open-ended question parameters partition percentages plane points position possible principal axes principal components analysis relative repeated segments representation row-point rows and columns selected singular value decomposition space stylometry supplementary elements survey techniques test-values textual data textual statistics textual units Tokyo total number variables variance vector vocabulary YAKIZAKANA