Retrieving Data from Text Archives
Text archives are repositories of literary and non-literary texts that have usually been scanned from the original sources and can be retrieved in a variety of different formats or encodings on the internet. Issues related to different formats, etc. will be discussed a little later on on this course. Many – if not even most – of the texts available there are even free of copyright or available under academic licences, but before you publish anything based on such a text, it is usually advisable to inform yourself about any potential issues.
Perhaps the two most important text archives are the Project Gutenberg website and the Oxford Text Archive and we will here concentrate on downloading texts from them, but if you want to learn about further archives or text repositories, you can consult this presentation page from an earlier course. For some initial practice, let’s download a text from the Project Gutenberg website and have a look at it.
- Open the Project Gutenberg site by clicking on the link given above.
- Click the Search! button under the heading “Quick Search” on the right-hand side without typing in an author or title. The online book catalogue should now open.
- Take a look at the different options available for finding particular books, then think of a particular author whose book(s) you may want to analyse, then click on the appropriate initial letter of the author’s name.
- Scroll down the list until you find the name of the author or press Ctrl+f on your keyboard in order to use the browser’s find functionality and search for the name. Once you’ve found the particular author you were looking for, select one of the titles and click on the appropriate link. You will now be redirected to a new page with a listing for the book you selected.
- Take some time to look at the information provide on this page, especially with regard to copyright, and then look at the table at the bottom listing all the different formats available for download. You’ll notice that there may be variety of formats available, but the most useful for our purposes will usually be “Plain Text” and “none” under the heading “Compression”.
- Find the link to “main site” under “Download Links” and press the right mouse button on it. From the context menu(e) that will open, select “Save Link As ...” and save the file to your home folder, possibly changing the name to something more telling than the original file name suggested.
Once you have downloaded a text file from the website, you can open it in an appropriate text editor and have a look at the contents. If you’re working in a Windows environment, the editor associated with plain text (.txt) files will usually be Notepad, but under Linux or any other operating system, usually one editor will be associated with them by default, too, and open up if you double-click on the file name in your file browser. We will explore different ways of preparing or using these files for our analysis later on.