This week saw a brief explosion of a Markov Chainer toy for Twitter. It even got mentioned on CNN! The internet seems to rediscover Markov Chainers every few years. It's easy to understand why. There is the vanity aspects of it using your own words as a training corpus, mixed with some healthy self-mockery when you see your writing style stereotyped so blatantly. For being so simple, the output is surprisingly convincing if you don't look too carefully at it. And they are simple, dead simple, absolutely trivial to implement. So of course they get implemented a lot, for every platform and content channel.
It strikes me as an oddity, though, that an obscure statistical model is so commonly known, if not understood. I first ran into them on MOOs back in the mid-90s, where they were usually trained on ambient conversation instead of a single person's output. I was vaguely aware they were a Thing beyond a silly toy, but it wasn't until machine learning and computational linguistics classes in grad that I really learned what was going on. String generation is just a fun side-effect of why they're actually interesting and useful.
So what are they, really? Start with a sequence of symbols. This could be anything, but human language is the obvious application. Even within that context there are a lot of options for symbols. They could be phonemes from spoken speech, letters of the alphabet within words, entire words within a sentence or even parts of speech. We'll stick with words, as that's the clearest example.
Now collect a bunch of these sequences -- the more the better. We're engaging in corpus linguistics here, where you extract statistical patterns from samples of natural language. The larger the corpus, the more accurate the stats will be. So my entire Twitter feed is a good start, but my Livejournal backlog would be better. Or maybe the archives of the New York Times, or all of Google Books, if you want the stats to reflect something broader than my personal writing style. Whatever.
In addition to all the symbols seen, add two more: START and FINISH at the beginning and end of each sequence. Now break your corpus down into individual symbols. For every symbol, see which symbol follows it. Create a giant lookup table with every symbol counting how often every other symbol follows it. What we're doing is simply calculating the statistics of how often one symbol follows another. For instance, if the first word is "robot", then statistically we know that the second word is a lot more likely to be "army" or "uprising" than, say, "parakeet". And you can go deeper, if you like: For every PAIR of symbols, keep track of what symbol follows them. That's called a trigram. In general, the higher your n-grams are, the better the results will be, but the storage size explodes exponentially, as does the required size of your training corpus. After all, how often will you see "Seven crimson rockets" in order to learn its probability? Markov Chains using words are almost always only bigram, but phonemes or parts of speech can get higher.
That's it, you have a Markov Chain. Congratulations! What can you do with it, you ask? Well, we can play the string generation game which is so popular. Look at the entry for the START symbol, and all the symbols which follow it. This shows how likely each word is to start a sentence. Choose one based on these probabilities. Now look it up and find the probabilities of different words to follow it. Choose one based on these probabilities, and repeat until you hit a FINISH symbol. You've generated a sentence which contains many of the statistical properties of English! It won't mean anything, but it will (usually) be fairly convincing at first glance. If you play this trick with phonemes, you'll get a word which sounds like it could be English. If you do it with letters, you'll get a word which looks like English. ("Fomash" as compared to "qxamt", for instance.) Cute trick, super easy, not very useful.
There are some really good uses for Markov Chains, however. One that I've implemented myself is part of speech prediction. Here you take a corpus which has been tagged with parts of speech for every word. (These tend to be smaller, as tagging is very labor intensive!) You can generate a Markov Chain for these sequences, same as any other. Now, take a new, untagged sentence. You can look at each word and say how likely it is for each word to be a certain part of speech, based on the corpus. Some words only have 1 or 2 possibilities, but particularly when dealing with serious linguistic models that have 40+ parts of speech, most words have several possibilities. Just looking at individual probabilities, it's a pretty hopeless problem. But we have more information available -- we can look at the n-gram probabilities as well! (This is a case where we might be dealing with trigrams or even higher.)
You can conceive of this model as a sequence of hidden symbols (parts of speech) which are responsible for a sequence of visible symbols (words). We know how likely each hidden symbol is to follow another, and we know how likely each hidden symbol is to generate a given visible symbol. This combination is called a Hidden Markov Model, and there are lots of well-developed algorithms for taking advantage of the extra information provided, specifically for this problem the Viterbi algoritm. Given a sequence of visible symbols (a sentence) it will predict what the most likely sequence of hidden symbols (parts of speech) underlying it is.
Applying this to the parts of speech problem is really amazing, if you ask me. I wrote an implementation for a class I was TAing, and it took maybe an hour, at which point it was tagging parts of speech with > 95% accuracy, given a fairly small training corpus. Serious implementations, using the largest corpuses available, can get well over 99%. They're matching the accuracy of human taggers, in fact. Someone has to go through the training examples and label all the parts of speech to begin with, and they make mistakes sometimes. Also, even the best experts won't always quite agree on which part of speech is correct in tricky examples. The automatic systems are now just as good, and probably would be even better if they weren't dependant on training input made by us puny humans!
And that's why Markov Chains are actually cool. :)
It strikes me as an oddity, though, that an obscure statistical model is so commonly known, if not understood. I first ran into them on MOOs back in the mid-90s, where they were usually trained on ambient conversation instead of a single person's output. I was vaguely aware they were a Thing beyond a silly toy, but it wasn't until machine learning and computational linguistics classes in grad that I really learned what was going on. String generation is just a fun side-effect of why they're actually interesting and useful.
So what are they, really? Start with a sequence of symbols. This could be anything, but human language is the obvious application. Even within that context there are a lot of options for symbols. They could be phonemes from spoken speech, letters of the alphabet within words, entire words within a sentence or even parts of speech. We'll stick with words, as that's the clearest example.
Now collect a bunch of these sequences -- the more the better. We're engaging in corpus linguistics here, where you extract statistical patterns from samples of natural language. The larger the corpus, the more accurate the stats will be. So my entire Twitter feed is a good start, but my Livejournal backlog would be better. Or maybe the archives of the New York Times, or all of Google Books, if you want the stats to reflect something broader than my personal writing style. Whatever.
In addition to all the symbols seen, add two more: START and FINISH at the beginning and end of each sequence. Now break your corpus down into individual symbols. For every symbol, see which symbol follows it. Create a giant lookup table with every symbol counting how often every other symbol follows it. What we're doing is simply calculating the statistics of how often one symbol follows another. For instance, if the first word is "robot", then statistically we know that the second word is a lot more likely to be "army" or "uprising" than, say, "parakeet". And you can go deeper, if you like: For every PAIR of symbols, keep track of what symbol follows them. That's called a trigram. In general, the higher your n-grams are, the better the results will be, but the storage size explodes exponentially, as does the required size of your training corpus. After all, how often will you see "Seven crimson rockets" in order to learn its probability? Markov Chains using words are almost always only bigram, but phonemes or parts of speech can get higher.
That's it, you have a Markov Chain. Congratulations! What can you do with it, you ask? Well, we can play the string generation game which is so popular. Look at the entry for the START symbol, and all the symbols which follow it. This shows how likely each word is to start a sentence. Choose one based on these probabilities. Now look it up and find the probabilities of different words to follow it. Choose one based on these probabilities, and repeat until you hit a FINISH symbol. You've generated a sentence which contains many of the statistical properties of English! It won't mean anything, but it will (usually) be fairly convincing at first glance. If you play this trick with phonemes, you'll get a word which sounds like it could be English. If you do it with letters, you'll get a word which looks like English. ("Fomash" as compared to "qxamt", for instance.) Cute trick, super easy, not very useful.
There are some really good uses for Markov Chains, however. One that I've implemented myself is part of speech prediction. Here you take a corpus which has been tagged with parts of speech for every word. (These tend to be smaller, as tagging is very labor intensive!) You can generate a Markov Chain for these sequences, same as any other. Now, take a new, untagged sentence. You can look at each word and say how likely it is for each word to be a certain part of speech, based on the corpus. Some words only have 1 or 2 possibilities, but particularly when dealing with serious linguistic models that have 40+ parts of speech, most words have several possibilities. Just looking at individual probabilities, it's a pretty hopeless problem. But we have more information available -- we can look at the n-gram probabilities as well! (This is a case where we might be dealing with trigrams or even higher.)
You can conceive of this model as a sequence of hidden symbols (parts of speech) which are responsible for a sequence of visible symbols (words). We know how likely each hidden symbol is to follow another, and we know how likely each hidden symbol is to generate a given visible symbol. This combination is called a Hidden Markov Model, and there are lots of well-developed algorithms for taking advantage of the extra information provided, specifically for this problem the Viterbi algoritm. Given a sequence of visible symbols (a sentence) it will predict what the most likely sequence of hidden symbols (parts of speech) underlying it is.
Applying this to the parts of speech problem is really amazing, if you ask me. I wrote an implementation for a class I was TAing, and it took maybe an hour, at which point it was tagging parts of speech with > 95% accuracy, given a fairly small training corpus. Serious implementations, using the largest corpuses available, can get well over 99%. They're matching the accuracy of human taggers, in fact. Someone has to go through the training examples and label all the parts of speech to begin with, and they make mistakes sometimes. Also, even the best experts won't always quite agree on which part of speech is correct in tricky examples. The automatic systems are now just as good, and probably would be even better if they weren't dependant on training input made by us puny humans!
And that's why Markov Chains are actually cool. :)