articles/Mimno2013.Rmd

---
title: "Introduction to R mallet"
author: "David Mimno"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{mallet}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Installation

The ```mallet``` R package is available on CRAN. To install, simply use ```install.packages()```

```{r, eval=FALSE}
install.packages("mallet")
```

To load the package, simply use ```library()```.

```{r}
library(mallet)
```


## Usage

We start out by using the example data from the ```tm``` package.

```{r}
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain))
reuters_text_vector <- unlist(lapply(reuters, as.character))
```

We can also use the stopword file from the ```tm``` package.

```{r}
stopwords_en <- system.file("stopwords/english.dat", package = "tm")
```

Create a mallet instance list object. Right now I have to specify the stoplist as a file, I can't pass in a list from R.
This function has a few hidden options (whether to lowercase, how we define a token). See ```?mallet.import``` for details.

```{r}
mallet.instances <- mallet.import(id.array = as.character(1:length(reuters_text_vector)), 
                                  text.array = reuters_text_vector, 
                                  stoplist.file = stopwords_en,
                                  token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
```

Create a topic trainer object.

```{r}
topic.model <- MalletLDA(num.topics=5, alpha.sum = 1, beta = 0.1)
```

Load our documents. We could also pass in the filename of a saved instance list file that we build from the command-line tools.

```{r}
topic.model$loadDocuments(mallet.instances)
```

Get the vocabulary, and some statistics about word frequencies. These may be useful in further curating the stopword list.

```{r}
vocabulary <- topic.model$getVocabulary()
head(vocabulary)

word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
```

Get the vocabulary, and some statistics about word frequencies. These may be useful in further curating the stopword list.

```{r}
vocabulary <- topic.model$getVocabulary()
head(vocabulary)

word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
```


Optimize hyperparameters every 20 iterations, after 50 burn-in iterations.

```{r}
topic.model$setAlphaOptimization(20, 50)
```

Now train a model. Note that hyperparameter optimization is on, by default. We can specify the number of iterations. Here we'll use a large-ish round number.

```{r}
topic.model$train(200)
```

**NEW** Run through a few iterations where we pick the best topic for each token, rather than sampling from the posterior distribution.

```{r}
topic.model$maximize(10)
```

Get the probability of topics in documents and the probability of words in topics. By default, these functions return raw word counts. Here we want probabilities,so we normalize, and add "smoothing" so that nothing has exactly 0 probability.

```{r}
doc.topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
```

What are the top words in topic 2? Notice that R indexes from 1 and Java from 0, so this will be the topic that mallet called topic 1.

```{r}
mallet.top.words(topic.model, word.weights = topic.words[2,], num.top.words = 5)
```

Show the first document with at least 5% tokens belonging to topic 1.

```{r}
inspect(reuters[doc.topics[,1] > 0.05][1])
```

How do topics differ across different sub-corpora?

```{r}
usa_articles <- unlist(meta(reuters, "places")) == "usa"

usa.topic.words <- mallet.subset.topic.words(topic.model, 
                                              subset.docs = usa_articles,
                                              smoothed=TRUE, 
                                              normalized=TRUE)
other.topic.words <- mallet.subset.topic.words(topic.model, 
                                               subset.docs = !usa_articles,
                                               smoothed=TRUE, 
                                               normalized=TRUE)
```

How do they compare?

```{r}
head(mallet.top.words(topic.model, usa.topic.words[1,]))
head(mallet.top.words(topic.model, other.topic.words[1,]))
```