A New York Times Daily COVID-19 Word Cloud
drowning in words
Last night I really hit my wall for taking in any information about the COVID-19 pandemic. I think that most of us try to stay educated and informed, but the barrage of news - and dire news at that - at some point effects your physical and mental state and I’ve hit survival mode. I tried limiting my media intake to one hour in the morning and one hour at night, but the social isolation has me reaching out to social media channels that are themselves another onslaught of information. Twitter, FB, and IG are no longer just the cat videos or bike races or daydreams of that next holiday; scrolling down every single post on social media finds another story of the virus. Obviously, stepping away from the computer or mobile device [gasp] is the best way to deescalate rising anxiety, but even outdoors we are witnessing the impact of empty storefronts and roads. There really is no escaping this event.
Today in particular I felt struck in the chest by the barrage of words used by the media as they cover the COVID-19 pandemic. Every day - and for good reason - we are pummeled by weaponized words of dread, death, and doom. Having been an emergency responder for almost 20 years, I know that there is a huge difference between being in the hot zone and being on the public side where we are fed highly filtered information through the media distilled to the most dramatic. Part of that is to get the information out and part of that is to sell a story regardless of what reality looks like. But at what cost to the public? I’m at the point where I don’t even read the news articles because I’m so overwhelmed with the words of hopelessness written in the headlines.
who’s the boss now, words
Only three days after returning from spring break, our classes for the rest of spring semester at Duquense were cancelled on campus and slowly moved on-line. Adding to the feeling of helplessness with the pandemic is the lack of daily structure to juggle multiple goals and to challenge my brain that was being ripped out of its comfort zone on a regular basis. Well, Words, I’ve decided that it’s time for me to take my power back. This evening I harnessed some of my curiosity and creativity (and found another reason to procrastinate from the two papers that I do have to write) by making daily word clouds and bar charts to analyze what words the media are using to describe their coverage. I’m in control now!
The idea of a word cloud is to give weight, or a larger font size, to those words that are used most often in a document. This is a well known way to demonstrate the power of words as a simple infographic. In the example to the right, the word Pandemic is used on the webpage 18 times so its font size is 18px. The word China is used 9 times on the webpage and has a font size of 9px.
I tried a few free word cloud apps, such as WordArt and the Google Docs Add-On, Word Cloud Generator. I settled on WordArt because of its many useful data editing features: Filter to find words you’re looking for; delete words that you do not find to be relevant (ex. Through, During, Day, Much, etc); edit words to make ones you need (ex. New + York = New York); and delete large groups (ex. Count = 1). Google Docs did not have this flexibility, and when dealing with a New York Times webpage that is almost 3,500 words, removing some items makes a big difference in the word cloud’s visual impact.
Goal
My goal is to examine both the New York Times Homepage and special Coronavirus webpage, which is free access to the daily compilation of their articles on the event. I chose both because the homepage is…well the homepage and the first interaction that a user has with the site. I was curious what kind of words were used as a whole there as well as on the Coronavirus-specific page. I will create a word cloud each day through March 31, 2020, to see if there is any change in the words used by the media. If coverage of the pandemic is extended into April or god forbid May (because we’re finally handling it right??), then hopefully that will result in more positive choice of words and I will start reading the news again. To procrastinate on those two papers a little more, I will also create bar charts in Squarespace showing the Top 20 words used each day on both pages.
Methods
At 7 p.m. EDT I save the webpages as TXT files and name with the date (ex. NYT Homepage 2020_0319.txt). For the NYT Coronavirus webpage, I make sure to capture all the articles for the day by clicking the Show More button at the bottom of the webpage.
For each txt file, I open it in WordPad and delete all the HTML code, journalist names (sorry but yeah), deadline dates, special characters not read from HTML, NYT section names, duplicative photo descriptions, and references to advertisements. The goal is to analyze text that is only headlines and leads.
I copy/paste the final text into a new page on this website and plug the URL into WordArt, which imports all the page’s words into a list. (The app will only take active URLs, CSV, or Excel files.)
I go through the list of words and delete any that may be irrelevant to the word cloud and keep track of deleted items in a separate Notepad document (yes this is highly subjective but what are you doing in your free time :) With this project I also delete words that are only used once, which eliminates nearly half of the list and makes for a cleaner word cloud. So far the final list includes words used at least twice.
In WordArt, I download the list of final words as a CSV, which will be used later.
The default settings in WordArt make the graphic easy to use as is, but there are several great features that can be edited such as the fonts, font colors, and image shape. Then it’s ready to save as a PNG or JPG, or as I’ve done, copy the code to share it as an embedded interactive image.
In Squarespace, I add the embedded code from WordArt to show the final word cloud.
In Excel, I open the CSV file to grab the top 20 words that were used in WordArt, and save that as a new CSV. I open that in Notepad to use as a bar graph in Squarespace.
Finally, in Squarespace I add a add bar graph using the top 20 CSV data. I thought this would help visualize the popularity of words through the study period. .
*Notes
Where you see “Home” in the results, it does not refer to “homepage” or “home” in reference to a webpage. It refers to a domicile.
Where you see “New” that is used in the text as an independent adjective, not in reference to “New York,” “New Jersey,” or “new cases.”
The word “Coronavirus” was removed from the Coronavirus webpage analysis because that word would always be at the top.
The bar charts show the Top 20-ish words. I did not want to cap it at 20 and leave off a few other words that also have the same “count” as number 20.
The “Powered by WordArt” text is the price we pay for cool free stuff :)
You should be able to click on the bar chart results to see the actual count for each word. Has been acting fickle though.