Fork me on GitHub

Articles in the category Data

  1. The Dirichlet-Multinomial in PyMC3

    Modeling Overdispersion in Compositional Count Data

    Having just spent a few too many hours working on the Dirichlet-multinomial distribution in PyMC3, I thought I'd convert the demo notebook I also contributed into a blog post.

    This example (exported and minimally edited from a Jupyter Notebook) demonstrates the use of a Dirichlet mixture of multinomials (a.k.a Dirichlet-multinomial or DM) to model categorical count data. Models like this one are important in a variety of areas, including natural language processing, ecology, bioinformatics, and more.

    The Dirichlet-multinomial can be understood as draws from a Multinomial distribution where each sample has a slightly different probability vector, which is itself drawn from a common Dirichlet distribution. This contrasts with the Multinomial distribution, which assumes that all observations arise from a single fixed probability vector. This enables the Dirichlet-multinomial to accommodate more variable (a.k.a, over-dispersed) count data than the Multinomial.

    Other examples of over-dispersed count distributions are the …

  2. Software carpentry instructor training

    A survival analysis in python

    Edit (2016-05-31): Added a hypothesis for why my results differ somewhat from Erin Becker's. Briefly: I removed individuals who taught before they were officially certified.

    A couple weeks ago, Greg Wilson asked the Software Carpentry community for feedback on a collection of data about the organization's instructors, when they were certified, and when they taught. Having dabbled in survival analysis, I was excited to explore the data within that context.

    Survival analysis is focused on time-to-event data, for example time from birth until death, but also time to failure of engineered systems, or in this case, time from instructor certification to first teaching a workshop. The language is somewhat morbid, but helps with talking precisely about models that can easily be applied to a variety of data, only sometimes involving death or failure. The power of modern survival analysis is the ability to include results from subjects who have not …

Page 1 / 1