|  | Home > Research Help > humanities > Issues in Using E-Texts
- the digitization of the text itself. This can be simply scanned page images, that are not searchable. It can also include running the scanned pages through a software program called Optical Character Recognition (OCR) that digitizes the individual words on a page, so that they are searchable.
- the building of structural systems to allow retrieval and navigation. Metadata is information about the content and nature of the electronic document, that is added to a database and is invisible to the end-user. Tagging is also often performed on the data. Tagging is structural metadata, that defines the structure of a document, or defines the pieces of a document (such as paragraphs, images, chapters, notes, table of contents, tables, title page, index, etc.)
- A database of page images, plus a search interface. Some metadata will be present, so that searching the citation, but not the full-text, is possible.
| | - The above content, plus metadata, plus "dirty ASCII". Basically what is included is page images plus searchable text. OCR has been performed on the page images, so that full-text searching is available
| | - A database of texts that have been keyed in by hand, resulting in a high degree of accuracy. Often, these keyed-in texts have also been marked up ("tagged") with SGML or XML, so that searching is quite efficient
| | - Keyed-in texts that are also critical editions, and that include some critical apparatus. These have also been marked up, for better searching
| | - The above database, plus page images
| Le Roman de la Rose | - A database of primary sources plus secondary literature. Literary texts are included, in the context of critical material. Sometimes, the secondary material can be quite extensive, as in the online archive projects
| The Rossetti Archive Princeton Dante Project Uncle Tom's Cabin |
Depending on the nature of the e-text collection, you will be able to search on different things:
- only the metadata, if only page images are present. That is, the record or citation for the text, including title, author, publication information, and maybe subjects.
- usually the full-text of the works if OCR has been performed on the page images
- if the database also includes secondary literature, access to more than just the texts themselves will be available: the context as well
- a related issue is the ability to search across texts. Sometimes search capability is limited to one text at a time; sometimes the ability to search across multiple texts is present.
, and may not be present at all.
- in the absence of a search interface, the Find in Page function can be used in the Browser. Depending on the display, this could limit searching to a page at a time, so there is no ability to search an entire work.
- search engines may only consist of a single search box, with no field searching capability
- advanced search options will often provide a means to search individual fields and/or to combine fields
- other limits can sometimes be present in sophisticated search engines: the ability to limit by date, format, language, etc.
- word searches, to find occurrences, frequency, and contexts of individual words or phrases
- create concordances
- trace a theme or motif across a large corpus of texts, or an individual writer's works
- find particular passages or quotations in large bodies of texts
- make connections between 2 or more words, concepts, themes
- changes in the format and materiality of texts result in changes in their appropriation. How e-texts are read and used may be quite different from how their original texts, in codex form, are read and used.
- E-texts can be reconfigured, reformed, copied, moved, rewritten, and manipulated in ways no written text ever could.
- there are many interfaces to contend with in viewing and searching e-texts. And these interfaces change often.
- e-texts on the open Internet come and go, appear and disappear, or at the very least, change their "location".
- which editions are used as the basis of an e-text. Are the best critical editions chosen? Are the easiest editions (those out of copyright) used? Are only translations used, without access to the original language? Which translations are used?
- e-texts are most often created by commercial publishers, who think in terms of a product to market, not in terms of research needs or methodologies.
- perhaps the essential problem with current access to e-texts is their disparate nature. They exist freely all over the Internet, or as very expensive commercial products, or as individual hobby-type projects, or as semi-professional online archives, or as highly sophisticated digital archive projects undertaken by research institutions with major grant funding. Quality control and consistency is almost non-existent.
|