Assignment 3 – text mining
Purpose: Some of the most important data that we have are large quantities of text data. Text data includes books, articles, blog posts, social media posts, emails, reports, journals and diaries, shopping lists, etc. etc. etc. This data is unstructured and can be massive. We have software tools available to help us make sense out of large quantities of text data. When we use software to analyze and try to make sense out of text data, we call this text mining. A practical example of this is chat bots. Chat bots utilize algorithms to try to make sense out of what the user types so it can select an appropriate response. In this assignment we will explore the text mining strategies of looking at word frequency and occurrences. Other methods beyond the scope of this class include part of speech tagging and machine learning.
In this assignment you will need two text files with text data. You can copy articles from the web, download free ebooks as a text file (https://www.gutenberg.org/), use a paper that you’ve written, etc. It must be in the format of a text file (.txt). You can use any word processing application to save text as a .txt file including Microsoft Word, Wordpad, Notepad, or Notepad++ . You will use two free browser based text mining tools to analyze each document.
Task 1. Visualizing Word Frequency. Looking at the frequency of occurrence of each word can give you an overall sense of a document. You will use your first text file to create a word cloud to visualize frequency and understand how stop words affect word frequency analysis. We will use Lexos for this: http://lexos.wheatoncollege.edu/
Task 2. Taking a deeper look at word frequency and occurrence. For the second text document we will use Voyant Tools to explore other visualization methods as well as look at word correlations.
Download the word document Template to cut and paste your screen shots and answer the questions for each task. This website can help you find how to take a screenshot on whatever device you are using: https://www.take-a-screenshot.org/
Steps for Task 1:
1. Open the following website: http://lexos.wheatoncollege.edu/upload
2. Click browse and upload your text file (or drag the file). [important note: Do NOT use the scrape url tool! You will end out with tags in your text document that would need to be cleaned.]
3. Click the ‘Manage’ tab and make sure you have your txt document (and only your text document) selected. You should see the blue bubble filled on the left.
4. Click the ‘Visualize’ tab and select Word Cloud. The word cloud shows the most frequent words as larger that the other words. You should notice that the largest words are common words like and, the, and I. These are called stop words and don’t convey much meaning. Take a screen shot and add to your word document.
5. The next step is to remove the stop words from our word cloud. Attached to this assignment is a text file of stop words. Download this to your device. In Lexos, click the ‘Prepare’ tab. You will see a section called Stop/Keep words. Click the ‘stop’ button. Click upload and upload the stopwords.txt file from your device. Click the apply button in the Previews window to the right. You should notice the preview of the text change.
6. After applying click the ‘Visualize’ tab again. Your word cloud should be very different now. Take a screen shot and add it to your word document.
7. Answer the questions in the Word document for Task 1.
Steps for Task 2:
1. Open the following website: https://voyant-tools.org/
2. Upload your text file.
3. You will see data in five different windows and each window has a tab.
a. In the upper left window click ‘Links’
b. In the upper middle window click ‘TermsBerry’
c. In the upper right window, leave it on ‘Trends’ (or click it if it’s not there)
d. In the bottom left window, leave it on ‘Summary’ (or click it if not there)
e. In the bottom right window click ‘Correlations’
4. Take a screen shot and add it to your Word document.
5. Answer the questions in the Word document for Task 2.
Upload the following:
1. Word document with screenshots and answers to the questions for Tasks 1 and 2
2. Your text files that you used in Tasks 1 and 2.