ELMCIP Tags
I can see why Controlled Vocabularies are so popular…
This week, I looked at the tags used within the Electronic Literature Knowledge Base (ELMCIP) in the hopes of getting a baseline understanding of how individual e-lit works are categorized within a digital collection. Visitors to ELMCIP can view tags by their frequency (in a list of 20 terms at a time), alphabetically (in an HTML table), or as part of a work’s record (as comma-separated strings within a table of contents). To see the entire body of tags would require clicking through hundreds of pages, so I decided to put my meager web-scraping skills to the test (despite hating the term “web-scraping” - it sounds deeply unpleasant!).
Cleaning
I made CSV spreadsheets of each version of the tag list (frequency, alphabetically, and record tags) - my goal was to have a list of tag names, tag ID numbers (which could be gleaned by the URL of each tag), and frequency of use. However, I noticed a lot of duplicate values showing up: tags that had the exact same name and ID number. At first, I thought it was something I was doing wrong (this was my first time web-scraping - hooray for internet tutorials!), or but after manually looking through the first ten pages of tags on the website, I saw that some tags were listed multiple times. I also noticed that the list of alphabetical tags didn’t match up entirely with the frequency tags: my guess is that frequency tags have to have been used at least once to be listed, and so the extra tags on the alphabetical list were those that had not been used before.
In any case, after several hours spent consolidating the alphabetical list with the frequency list and removing the duplications, I was able to narrow down the single list to tags from over 11,000 terms to over 8,000 terms.*
* (It occurs to me that I could also turn the tags listed with the works records into a usable dataset, and determine the actual frequency with which each tag is used… but that would take much, much more time.)
Observations
Just by looking at the frequency list, the most commonly used tags appear to be those that describe what the resource is: hypertext, poetry, Flash, narrative, audio, etc. This makes sense - users want to know what to expect before they click on something. The next most frequently used terms are still concerned with the form of the work, but these include adjectives that create their categories: digital poetry, visual poetry, e-poetry. It’s hard to tell if these are distinct categories or synonyms. The tags gradually become more specific, alluding to the content’s creator, country of origin, language, and other metadata attributes. Interestingly, there doesn’t seem to be many as tags describing what the resource is about.
Along with the tags, each work entry on ELMCIP has the following metadata: Author, Year, Appears in (Collections), Publisher, URL(s), Language, Publication Type, Platform/Software, Record Status (Internal descriptor), a Description of the resource, Technical Notes, and Critical Writing that references the work. These fields make some of the tags redundant, which begs the question: who creates the tags in the first place? (The author? ELMCIP? Viewers?)
While tags are useful for informally describing things, they don’t work so well in terms of gathering together similar materials. Misspellings, translations, and variations on existing terms scatter related works among dozens of tags; to cover all the bases, one would have to add all possible variations to a work’s description, making for a bloated list of tags that ultimately doesn’t tell the user anything. Instead, it would be useful to establish a Controlled Vocabulary so that the super specific tags and numerous variations would relate to a broader category, making similar works easier to find. (But that is a story for another day!)