My fascination is to develop simple and novel mathematical models to address interesting and challenging practical problems. During my PhD I have focused on text datasets, they are ubiquitous and possess some interesting challenges requiring novel techniques. Broadly, I am interested in
I am proficient and experienced in using both variational approximation as well as MCMC based inference for topic models. I like to explore different areas of application and in recent past I have worked with various types of text datasets like software projects, speech transcripts, multi-lingual corpora, news/blogs and comments. I have observed some challenging problems associated with these applications and developed novel mathematical models to solve them.
Overview of my recent work
Topic models are popular mathematical tools for analysing text datasets, where a corpus is a collection of documents. The state of art notion in topic models was to use single topic vector per document.
I came up with the novel yet simple idea of using multiple topic vectors (MTV). We have observed phenomenal ability of MTV in (i) discovering subtle topics, (ii) modeling specific correspondence. Both of them helped in inventing novel models (i) subtle topic models (STM, in ICML, 2013), (ii) specific correspondence topic models (SCTM, in WSDM, 2014).
Currently I am working on nonparametric Bayesian models for learning very large scale (more than 700 million tokens) datasets. There is NO method known using MCMC for such scale without using expensive parallel hardware. I have invented a novel Bayesian nonparametric prior and used the concept of MTV across documents to apply MCMC. We have observed significant results.