Wise Technology

Reflecting on 8 years of hcluster, an open source clustering package

As I wrote programs I needed for my own research, I strove to open source them so that other researchers might benefit. Over the holidays, I took some time to reflect on hcluster, a Python package I wrote back in 2007 during my PhD studies.

Wise.io advisor and Berkeley Professor Michael Franklin gave an insightful interview with Andreessen-Horowitz on the multiplying effect of open source software in research. In my career, I have been privileged to see firsthand this effect over many years. Professor Franklin and I share the vision that open source compounds the pace at which a researcher can rapidly build off of recent advances. Only time tells if programs we write for our own research are useful to other researchers. Building software to facilitate scientific discovery is as important to me if not more important as making a discovery itself.

The Python hcluster (not to be confused with R hcluster) was founded as a standalone project to perform hierarchical clustering in Python. hcluster can be used to group objects into “clusters” based on similarity. Examples of objects include sentences, chemical compounds, viruses, galaxies, time series, bacteria, or just about anything.

While I was a postdoc and later busy launching and building wise.io, my open source project hcluster grew up and took on a life of its own. hcluster has played a role in several advances, many published in Science and Nature. Here are just a few:

  • Alzheimer’s Disease, Parkinson’s, and Cystic Fibrosis research [1,2],
  • computational tools for drug discovery [3],
  • quantifying resistance of bacteria to antibiotics [4],
  • mapping bacterial species [5,6,7,8], viruses [9], and cancers [10,11]
  • geophysics and seismological analysis [12,13],
  • neuroscientific discovery with fMRI and EEG data [14],
  • natural language sentence understanding [15],
  • removing noisy artifacts from astrophysical surveys [16]

hcluster was a great success beyond my field of research. Sometimes the best way to have impact in science is to write software that isn’t novel in its own right, but enables others.  The feeling of its impact is, as I imagine, akin to that of being a proud parent.

I am now focused on building scalable, robust machine learning frameworks at wise.io, parts of which we plan to open source in the coming year.  I can only hope that some of what we’re building at wise.io will have as broad an impact as hcluster.

- Damian

[1] “Atomic View of a Toxic Amyloid Small Oligomer.” Science. Volume 335. No. 6073. Page 1228-1231. http://www.sciencemag.org/content/335/6073/1228.short

[2] “Mechanism-based corrector combination restores ΔF508-CFTR folding and function.” Nature Chemical Biology. September 2012. Page 444-454. http://www.nature.com/nchembio/journal/v9/n7/abs/nchembio.1253.html

[3] “IVSPlat 1.0: an integrated virtual screening platform with a molecular graphical interface.” Chemistry Central Journal. http://journal.chemistrycentral.com/content/6/1/2

[4] “Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations.” Nature. January 2013. Page 571-574. http://www.nature.com/nature/journal/v500/n7464/full/nature12344.html

[5] “Accurate and universal delineation of prokaryotic species.” Nature Methods. Volume 10. July 2013. Page 881-884. http://www.nature.com/nmeth/journal/v10/n9/abs/nmeth.2575.html

[6] “Use of whole genome sequences to develop a molecular phylogenetic framework for Rhodococcus fascians and the Rhodococcus genus.” Frontiers in Plant Science. August 2014. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4154481/

[7] “Defining bacterial species in the genomic era...” BMC Microbiology. Volume 12. Page 1-11. http://www.biomedcentral.com/content/pdf/1471-2180-12-302.pdf.

[8] “Diet rapidly and reproducibly alters the human gut microbiome.” Nature. Volume 505. Page 559-563. http://www.nature.com/nature/journal/v505/n7484/full/nature12820.html?WT.ec_id=NATURE-20140123

[9] “The diversity of zinc-finger genes on human chromosome 19 provides an evolutionary mechanism for defense against inherited endogenous retroviruses.” Nature Cell Differentiation and Death. Volume 21. 2014. Page 381-387. http://www.nature.com/cdd/journal/v21/n3/full/cdd2013150a.html

[10] “Gene signatures ESC, MYC and ERG-fusion are early markers of a potentially dangerous subtype of prostate cancer.” Medical Genomics. Volume 7. Page 1-13. http://www.biomedcentral.com/1755-8794/7/50

[11] “Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells.” Nature Biotechnology. Volume 8. August 2012. Page 777-782. http://www.nature.com/nbt/journal/v30/n8/full/nbt.2282.html

[12] “ObsPy – What can it do for data centers and observatories?” Annals of Geophysics. Volume 54. No. 1. Page 47-58. http://www.annalsofgeophysics.eu/index.php/annals/article/view/4838

[13] “Using cluster analysis to organize and explore regional GPS velocities.” Geophysical Research Letters. Volume 39. No. 18. Page 1-5. http://onlinelibrary.wiley.com/doi/10.1029/2012GL052755/full

[14] “PyMVPA: A Unifying Approach to the Analysis of Neuroscientific Data.” Neuroinformatics. Volume 3. No. 3. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638552/

[15] “Latent semantic sentence clustering for multi-document summarization.” Johanna Geiß. PhD Thesis. Computer Science. University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-802.pdf

[16] “Detrending time series for astronomical variability surveys.” Monthly Notices of the Royal Astronomical Society. 2009. Volume 397. Issue 1. Page 558-568. http://mnras.oxfordjournals.org/content/397/1/558.short


 

Dr. Damian Eads is a founder of wise.io and creator of its core machine learning technology.

ps. We’re looking for amazing engineers to help us build out our novel infrastructure to orchestrate massive machine learning pipelines. If you’re the one, get in touch!

Topics: Data Science, Software Engineering