Home
About
Contact Us
Activities
- By Discipline
- By Method
- Core Activities
- Distributed Activities
Resources
Publications
Get Involved
e-Science

Workshops and Seminars

Historical Text Mining

Organized by Paul Rayson (Lancaster University) and Dawn Archer (University of Central Lancashire) (20-21 July 2006).

(pdf)	(html)	Programme and papers
	(html)	Participants
	(html)	Abstracts
(pdf)	(html)	Workshop report
	(html)	Workshop resources

One of the central intentions of the workshop was to establish a network of scholars from the fields of: text mining and E-Science; corpus development and annotation; historical linguistics, dialectology and computational linguistics. It was felt that a discussion relating to the effective text mining of historical data was long overdue, especially in view of the rapid growth in (historical) digital resources (e.g. Open Content Alliance, Google Print, Early English Books Online). The workshop aimed to better define the relationship between the text mining/E-Science community, who are often involved in applying basic techniques to large scale datasets, and the corpus linguistic community, who tend to apply data-driven linguistic analysis and annotation techniques to relatively small datasets.

The workshop's aims were:

to raise awareness of the various techniques utilized and/or tools developed by researchers working within the various fields
to make scholars who work with historical data aware of existing text mining techniques that are applicable to their research needs
to familiarize such scholars with the use of these techniques and tools, by means of a series of tutorial sessions (e.g. GATE, WordSmith, VARD, VIEW, Wmatrix)
to investigate the problems of applying some 'modern' large-scale corpus annotation and analysis techniques to historical data
to encourage/enable a roundtable discussion, with the ultimate aim of determining what needs to be done to improve historical text mining and (importantly) identify possible future workshops and collaborative projects

One of the tools demonstrated, the VARD (Variant Detector) presently 'matches' spelling variants to their 'normalized' equivalents using a search and replace script and a list of terms. This is being extended so that variants may be detected and 'normalized' automatically, via fuzzy matching procedures. The VARD will enable historical linguists to undertake an empirical exploration of variation across four centuries (16th-19th), but its usefulness is not limited to the (historical) lexicographer. Indeed, the VARD will facilitate annotation of, and text retrieval from, previously unseen pre-20th century corpora, and thus is of potential benefit to the historian, the English scholar, and researchers interested in (historical) dialectology. VARD techniques will be applicable to detecting variants in, for example, the Scottish Corpus of Texts and Speech (SCOTS) and the Newcastle Electronic Corpus of Tyneside English (NECTE).

The tutorial sessions made use of licensed and freely available material, including: the Lancaster Newsbook Corpus (1640-1661); Nameless Shakespeare; the Lampeter Corpus of English Tracts (1640-1760); Corpus del Español (1200s-1900s); and the EEBO-TCP collection, which contains structured SGML/XML text editions for a significant portion of the Short Title Catalogue of Early English books published between 1473 and 1700.

AHDS Methods Taxonomy Terms

This item has been catalogued using a discipline and methods taxonomy. Learn more here.

Disciplines

Linguistics
History
English Literature and Languages
European Literature and Languages
Non-European Literature and Languages

Methods

Data Analysis - Collating
Data Analysis - Collocating
Data Analysis - Concording/Indexing
Data Analysis - Content analysis
Data Analysis - Data mining
Data Analysis - Searching/querying
Data Analysis - Parsing
Data Analysis - Stemmatics/cladistics
Data Analysis - Stylometrics
Data publishing and dissemination - Textual collaborative publishing
Data publishing and dissemination - Textual resource sharing
Data Structuring and enhancement - Coding/standardisation
Data Structuring and enhancement - Lemmatisation
Data Structuring and enhancement - Markup/text encoding - descriptive - conceptual
Data Structuring and enhancement - Markup/text encoding - descriptive - document structure
Data Structuring and enhancement - Markup/text encoding - descriptive - linguistic structure
Data Structuring and enhancement - Markup/text encoding - descriptive - nominal
Data Structuring and enhancement - Markup/text encoding - presentational
Data Structuring and enhancement - Markup/text encoding - referential

AHRC Methods Network

Workshops and Seminars

Historical Text Mining

AHDS Methods Taxonomy Terms

Disciplines

Methods