The JSTOR Understanding series is a research tool from JSTOR Labs that connects primary texts with journal articles and book chapters on JSTOR that cite those texts. Building on a previous project, Understanding Shakespeare, this beta release of the Understanding series expands the scope of the tool to include ten key works of British literature, the King James Bible, and all Shakespeare plays and sonnets.
Pick one works by browsing to it or searching for it by author or title.
Read the work -- you’ll notice as you do that next to each passage is a number. That number represents the number of articles and chapters in JSTOR that quote that specific passage.
Click on the passage you want to study. You’ll see a new window with relevant articles and chapters.
Research. Look through the articles and chapters quoting your passage.
You can filter the results you’re seeing by date, content type or whether or not you have access to the documents.
If you’re not sure based on the title and metadata of an article, you can preview how it discusses the passage by looking at the snippets or page-previews.
When you find one you want to read, click the title to go to the article page (where the quoted passages will be highlighted in the text), download the PDF or save it to MyWorkspace
Who can use the JSTOR Understanding series?
The JSTOR Understanding series is a free resource that is open to the public. All the works are free to read. Passages in the works are matched with scholarly articles and academic book chapters that quote the passage. Each matched article and chapter title cited is linked to the full-text version, which is available to people at JSTOR participating institutions or with individual access to JSTOR. Many articles in Understanding Series are also available for online reading with a free MyJSTOR account. The site is expected to be especially useful to novice researchers performing their first academic research on literary works. It may also be useful to upper-level undergraduate and graduate students, as well as educators teaching them, scholars writing about them, and even actors preparing their line-readings.
How did you choose texts to include? Where did they come from?
Currently, the works included in the Understanding Series are limited to texts in the public domain. Because we are using public domain texts, we are able to make them freely available within the JSTOR platform.
The Understanding Series is an expansion of an earlier site, Understanding Shakespeare, developed in collaboration with the Folger Shakespeare Library. All the texts for the Shakespeare Collection are Folger Digital Texts, which are freely available for all non-commercial uses. One of the most frequent responses we heard to Understanding Shakespeare was, “That’s wonderful! Can you do it for this text?” Of the many requests we heard, the King James Bible rose to the top. For our King James Bible Collection, we chose to work with texts from Wikisource. Last, we worked with the journal Studies in English Literature to identify the texts to include in the British Literature Collection, the texts for which also come either from Wikisource or from Project Gutenberg.
What does it mean that the Understanding series is BETA?
The JSTOR Understanding series is fully functional and ready for use. The “beta” label is intended to convey that the tool is still evolving. Adding a beta label means we can incorporate feedback from teachers, students and researchers as we develop it. Based on that feedback, the tool is likely to change before we remove the “beta” label. If you have any recommendations for how to make this series more valuable to you or your students, or if you find an issue or bug for us to address, please let us know at firstname.lastname@example.org
How did you create the links between the play and the scholarly articles? Is it possible that this methodology might miss some quoted passages?
The links between the works and the articles and chapters on JSTOR were created with a two-step process. For each text, a candidate set of articles and chapters was selected, usually by performing a full-text search on JSTOR for the work’s title, author, and main characters. Second, we performed fuzzy-text-matching between the text and any words appearing either in block quotes or between single- and double quotation-marks. Fuzzy-text-matching is a technique that looks not only for exact matches but for near-matches – this allows us to overcome spelling variants as well as OCR mistakes in JSTOR’s page-scan content. It is possible that each of these steps may lead to missed quoted passages. The first step could lead to us missing an article quoting a text without mentioning the work’s title, author or characters. The second step could lead us to miss an article which quotes an alternate version of the text (for example, a Shakespeare passage that does not appear in the Folger digital text). If you find an article or chapter that either should be included that isn’t, or vice versa, please let us know so we can keep refining our approach!
I want to geek out. Can you go into more detail about how you created the links?
You asked for it. We began by creating a database of all the quotations within JSTOR – we extracted every section of text within single quotes, double quotes or that is block-quoted. At this counting, we have over 340 million quotations in JSTOR. To find which of these quotations are of a specific work, we first create a subset of JSTOR articles to select from (we do this to conserve processor-time). The selection of candidate articles from the JSTOR corpus was performed using full-text queries on the JSTOR search index. The queries included keywords designed to find articles with references to specific texts. The keywords used in the search were purposely broad to identify as many candidate articles as possible from the archive corpus. For example, the full-text queries used for Macbeth was shakespeare* AND macbeth The query is based on the premise that an article quoting a Shakespeare play will include the word “Shakespeare” (including the possessive form) and the primary word(s) from the play title. Any articles containing quotes that do not satisfy this condition would not be included in the text analysis. Once we have a selection of candidate articles for each work, we compare the text of the work to that in the articles and chapters. Both block and inline quotes are considered in the matching process. Block quotes are identified by using OCR word/line coordinates to identify text passages indented from surrounding text. Inline quotes are identified by text bounded by quotation (“ or ‘) characters. The identified quotes from the articles on JSTOR and the work text are normalized to remove all punctuation (including line breaks) and rejoin any words split by line break hyphenation. All text is then converted to lowercase. Once the text is normalized, candidate matches are found using a fuzzy text matching process based on the Levenshtein distance measure. Levenshtein is a similarity measure of two texts that counts the minimum number of operations (the removal or insertion of a single character or the substitution of one character for another) required to transform one text into the other. Using this approach we find the substring from the play text with the smallest Levenshtein edit distance for each quote. For the best match, a similarity score is computed as the ratio of the Levenshtein distance and the length of the quoted text. In the case of Macbeth, for example, a total of 14,581 articles were analyzed. A total of 88,987 separate quotes were matched in these articles. After applying filters to the match candidates we ended up with a total of 6,071 matches that met our filtering thresholds. These matches were found in 1,155 articles. The thresholds for filtering candidate matches include a similarity score (calculated from the Levenshtein edit distance and the match length) and the match size. This approach works well for larger quotes but tends to include a good number of false hits on smaller passages (generally of 15-20 characters or fewer) when the quote consists of words/phrases in common use today. In The JSTOR Understanding series, we’re using fairly conservative values of 0.8 and 15 for the similarity score and minimum match length, which minimizes the number of false hits but has the unfortunate consequence of filtering out some good matches too. A future refinement of the filtering process will likely include a measurement of how common a phrase is in modern usage. This would enable us to keep an 11 character quote like “hurly burly” but inhibit something like “is not true”.
Where did all the images come from?
The majority of images used for each text and collection come from the Artstor Digital Library
Is there an API to the underlying data?
At this point we do not have a publicly available API, however if you’re interested, please let us know and we’ll be sure to contact you when one is made available.
The literary texts in the JSTOR Understanding series are open access on JSTOR, but an institutional or individual access account may be required to view the full-text of the linked journal articles and book chapters.
You may be able to access these through your institution, or by logging in with your MyJSTOR, or JPASS account.
JSTOR is part of ITHAKA, a not-for-profit organization helping the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.
©2000-2019 ITHAKA. All Rights Reserved. JSTOR®, the JSTOR logo, JPASS®, Artstor® and ITHAKA® are registered trademarks of ITHAKA.