nightshade has written a random header generator based on delfi.lt, a Lithuanian news portal, data. All code is here. The scraped headers are saved to an SQLite database and used as a basis for the generated ones. Frequencies of adjacent word pairs in the original headers are calculated and used as transition matrices, i.e. the probability of choosing a word ‘B’ after the current word, ‘A’, depends on the frequency with which ‘B’ follows ‘A’ in the delfi.lt header dataset.

Here’s the core of the generator script (Python, for clarity, although an optimised C version is used on the page instead):

def generate_title_words(db):
    c = db.cursor()
    c.execute('SELECT word, frequency FROM startwords;')
    startwords, startword_freqs = zip(*list(c.fetchall()))
    s = sum(startword_freqs)
    startword_probabilities = [x / s for x in startword_freqs]
    title_words = []
    word = startwords[numpy.random.choice(len(startwords), p=startword_probabilities)]
    while word != None:
        title_words.append(word)
        c.execute('SELECT word2, frequency FROM pairs WHERE word1 = ?;', (word,))
        words, word_freqs = zip(*list(c.fetchall()))
        s = sum(word_freqs)
        word_probabilities = [x / s for x in word_freqs]
        word = words[numpy.random.choice(len(words), p=word_probabilities)]
    return title_words