上一日,咱们引荐了"继诺贝尔奖得主罗默后, 又一世行首席匆匆离职了!背后的原因让人深思!",主要探讨一下世行与其两位首席经济学家间的恩恩怨怨。今天,ML计量研究小组主要引荐一下杜克大学Angela Zoss老师的《文本分析步骤, 工具, 途径和可视化如何做?》。





  • Ted Underwood – Where to start with text mining
  • Tooling Up for Digital Humanities – Text Analysis
  • Ryan Shaw – Text Mining
  • John Laudin – Text Analytics 101
  • O'Connor, Bamman, & Smith (2011) – Computational Text Analysis for Social Science
  • Ben Schmidt – Comparing Corpuses by Word Use


  • Native digital text
    • Email
      • (Thunderbird extension, MUSE*)
    • HTML
    • RSS feeds
    • Sample specific services:
      • Twitter
      • Wikipedia
      • Data Liberation Front
      • New York Times API
      • CMU Movie Summary Corpus
      • Corpus of Global Web-Based English (GloWbE)
      • PLOS Text Mining Collection
    • Tutorials for data collection from various services
  • Digitized
    • Internet Archive
    • Project Gutenberg
    • Google Books
    • Hathi Trust (Hathi Download Helper)
    • JSTOR Data for Research* (with Early Journal Content bundle, also from archive.org)
    • PubMed Open Access Subset
    • Monk Workbench*
    • Document Cloud*
    • Open American National Corpus (collection of American English from various sources)
    • WordHoard* (tagged literary texts)
    • Corpus of Contemporary American English
* - also has some processing/analysis capabilities




  • Removing stop words (deleting very common words like "a", "the", "and", etc.)
  • Stemming or lemmatization (ways of combining words that have the same linguistic root or stem)

Tip: 诸如Wordle之类的工具可能会删除停止词(也叫常见词),但它们可能会分别计算一个单词和该单词的复数形式,或者如上所述保留大小写差异。在将内容加载到词云生成器之前,请尝试将所有内容转换为小写并使用快速阻止工具。


  • 从PDF提取:
    • More timesavers to unlock public records data (PDFs into spreadsheets)
    • Tabula (Java program for all platforms)
    • gImageReader (OCR for images, PDFs)
  • 清理HTML / XML:
    • Beautiful Soup
    • scrubber (also lemmatizes, removes stop words with prepared lists)
    • HTML to Text (or Story) from Data Science Toolkit
  • 将制表符更改为逗号,删除换行符等.
    • Sort My List (also changes case, removes punctuation)
    • TextFixer
    • Transformer (rescue texts from old file formats)
    • Text Mechanic


  • Google Refine for entity normalization
  • Vard 2 for cleaning historical text
  • TextFixer for changing case, removing whitespace, sorting
  • Porter stemmer online for stemming text
  • Microsoft Word to convert formatting to structure
    • Finding and replacing formatting and special characters in Word
    • Using regular expressions in Word
    • Convert text to table and back
  • Microsoft Excel to split, concatenate, filter data
    • Excel Text to Columns tool
    • Excel Concatenate function
    • Word Frequency in Excel with Filters, COUNTIF



  • Windows系统
    (See also Top 10 Cheap Windows Text Editors with Regular Expressions)
    • Notepad++
    • GNU Emacs
    • Vim
    • Kate
    • jEdit (instructions)
    • NoteTab Light
    • Microsoft Word (Extended Instructions)
    • Notepad RE
    • Zeus Lite Editor
    • Programmer's Notepad
    • EditPad Lite
    • PSPad
    • SciTE
    • Crimson Editor
    • Sublime
  • Mac系统
    (See also Top 10 Cheap Mac OS X Text Editors with Regular Expressions)
    • GNU Emacs
    • Vim
    • jEdit (instructions)
    • Kate
    • Aquamacs
    • TextWrangler
    • Sublime
    • Microsoft Word (Extended Instructions)



  • Word frequency (lists of words and their frequencies)
    (See also: Word counts are amazing, Ted Underwood)
  • Collocation (words commonly appearing near each other)
  • Concordance (the contexts of a given word or set of words)
  • N-grams (common two-, three-, etc.- word phrases)
  • Entity recognition (identifying names, places, time periods, etc.)
  • Dictionary tagging (locating a specific set of words in the texts)


(From Underwood, T. (2012). Where to start with text mining.)
  • 文件分类
    • Information retrieval (e.g., search engines)
    • Supervised classification (e.g., guessing genres)
    • Unsupervised clustering (e.g., alternative “genres”)
  • 语料库比较(例如政治演讲)
  • 一段时间内使用的语言(例如Google ngram viewer)
  • 检测文档特征簇(i.e., topic modeling)
  • 实体(entity)识别/提取(e.g., geoparsing)
  • 可视化



  • Voyant Tools – word frequencies, concordance, word clouds, visualizations
  • TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface
  • Netlytic – word frequencies, concordance, dictionary tagging, network analysis
  • Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
  • Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc.
  • ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
  • Overview – Automatic topic tagging and visualization
  • Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
  • LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100


  • AntWord – word frequencies
  • AntConc – frequency lists, concordances, collocations, keywords, n-grams
  • TextSTAT – word frequencies, concordances
  • Concordance – word frequencies, concordances, indexes
  • Cowo - semantic network
  • WordHoard - word frequencies, concordances, collocations, scripting (includes tagged literary corpora)
  • CasualConc - kwic concordance lines, word clusters, collocation analysis, and word count
  • NVivo (Duke info) - can cluster sources based on text, also produces phrase nets and tag clouds
  • Tableau (LibGuide) - word clouds


  • TAPoR 2
    • TAPoRware recipes (tutorials)
  • DiRT - digital research tools



  • NVivo
  • brat rapid annotation tool


  • GATE
  • nltk
  • Stanford NLP Group Software
  • National Centre for Text Mining (includes some tools for medical texts)
  • Reporters' Lab Reviews: Entity Extraction
  • Michael Collins' notes on NLP
  • Natural (natural language facilities for Node.js)


  • Most powerful open source sentiment analysis tools
  • Bing Liu's Resources on Opinion Mining (including a sentiment lexicon)
  • NaCTeM Sentiment Analysis Test Site (web form)
  • pattern web mining module (python)
  • SentiWordNet
  • Umigon (for tweets, etc.)
  • List of sentiment analysis tools for Twitter


  • The Programming Historian - Lessons
  • Basic Unix workflow for Text Processing
  • Helpful Unix commands
  • Similarity and Dissimilarity Measures
  • An introduction to text analysis with python
  • Basic Text Analysis in Mathematica
  • Zend Framework - PHP framework for collecting data
  • Text Analysis with R for Students of Literature
  • Python Programming for the Humanities
  • Document Similarity with R



  • With Criminal Intent
  • Various artistic analyses/interpretations of texts by Stefanie Posavec
  • The state of our union is... dumber
  • wordcollider
  • Popcornjs sentiment tracker
  • Metropho.rs
  • Novel Views: Les Miserables
  • A Christmas Carol (TULP interactive)
  • Tolkien's Books Analyzed


  • Google n-gram viewer - word frequencies over time
  • bookworm Open Library - word frequencies over time
  • Historical culturomics of pronoun frequencies - pronoun frequencies by gender over time
  • The Words They Used - bubble cloud of words from national convention speeches, with size and color coding
  • Bib.ly - word frequencies throughout the Bible
  • Ye Shall Know Them By Their Words - word frequencies by topic for presidential nomination speeches (additional description)
  • FACTA+ Visualizer - tree map of term frequency
  • Inaugural language (Boston Globe) - radial scatterplots
  • Mining Books to Map Emotions - frequencies of sentiment terms over time


  • Termite - tabular, proportional symbol visualization of words and topics
  • PMLA topic network - a network view of the topics from a topic model of PMLA, where links are created for shared words between topics (additional description)
  • Using Word Clouds for Topic Modeling Results - visualizing the distribution of words for each topic as separate word clouds




