The workshop is most suitable for staff and postgraduate students, but all are welcome.
An eventbrite to sign up is available here.
The purpose of this workshop is to introduce researchers and interested parties to two key aspects of data preparation. A common problem when starting work on large scale processing of text is that it can be noisy, hard to analyse or structure in a machine readable manner.
In this workshop we will cover two common examples of problematic texts: crawled or downloaded web documents composed in html and (poorly) OCR’d texts taken from some historical corpus. The purpose of using these examples is to introduce participants to the tools and methods used in web-scraping and data wrangling.
The workshop will comprise of a presentation and semi-practical session; where the presentation will introduce the key problems and solutions to these methods and the practical session will present an illustrative example solution.
This workshop is not intended as a complete tutorial on how to prepare data, but serves as an introduction to provide participants with the information and knowledge of the potential tools to begin working on these problems themselves.
If you have any questions about this workshop, please contact the convenor, Ben Roberts.