Wise Technology

ParaText: CSV parsing at 2.5 GB per second

Despite extensive use of distributed databases and filesystems in data-driven workflows, there remains a persistent need to rapidly read text files on single machines. Surprisingly, most modern text file readers fail to take advantage of multi-core architectures, leaving much of the I/O bandwidth unused on high performance storage systems. Introduced here, ParaText, reads text files in parallel on a single multi-core machine to consume more of that bandwidth. The alpha release includes a parallel Comma Separated Values (CSV) reader with Python bindings.

Read More

Topics: Machine Learning, Data Science, Software Engineering

Towards Cost-Optimized Artificial Intelligence

If accuracy improves with more computation, why not throw in more time, people, hardware, and the concomitant energy costs? Seems reasonable but this approach misses the fundamental point of doing machine learning (and more broadly, AI): as a means to an end.  And so we need to have a little talk about cost-optimization, encompassing a much wider set of cost-assignable components than usually discussed in academia, industry, and the press. Viewing AI as a global optimization over cost (ie., dollars) puts the work throughout all parts of the value chain in perspective (including the driving origins of new specialized chips—like IBM TrueNorth Google's Tensor Processing Unit). Done right it will lead to, by definition, better outcomes.

Read More

Topics: Machine Learning, Data Science, Software Engineering

Cache Ugly Reporting Queries With Materialized Views and Docker

Confidence and trust in your SaaS product depends, in part, on the continual conveyance of the value of the solution you provide. The reporting vectors (web-based dashboards, daily emails, etc.) obviously depend upon the specifics of your product and your engagement plan with your customers. But underlying all sorts of reporting is the need to derive hard metrics from databases: What's the usage of your application by seat? How has that driven value/efficiency for them? What are the trends and anomalies worth calling out? 

The bad news is that many of the most insightful metrics require complex joins across tables; and as you scale out to more and more customers, queries across multitenant databases will take longer and longer. The good news is that, unlike for interactive exploration and real-time monitoring and alerting use cases, many of the queries against your production databases can be lazy and done periodically.

At Wise.io, we needed a way to cache and periodically update long-running/expensive queries so that we could have more responsive dashboards for our customers and our implementation engineers. After some research, including exploration with 3rd party vendors, we settled on leveraging materialized views. This is a brief primer on a lightweight caching/update solution that uses materialized views coupled with Docker.

Read More

Topics: Software Engineering

Make Docker images Smaller with This Trick

The architectural and organizational/process advantages of containerization (eg., via Docker) are commonly known. However, in constructing images, especially those that serve as the base for other images, adding functionality via package installation is a double edged sword. On one hand we want our images to be most useful for the purposes they are built but—as images are downloaded, moved around our networks and live in our production environments—we pay a real speed and cost price for bloated image sizes. The obvious onus on image creators is to make them as practically small as possible without sacrificing efficicacy and extensibility. This blog shows how we shrunk our images with a pretty simple trick...

Read More

Topics: Machine Learning, Data Science, Software Engineering

Asking RNNs+LTSMs: What Would Mozart Write?

Preamble: A natural progression beyond artificial intelligence is artificial creativity. I've been interested in AC for awhile and started learning of the various criteria that the scholarly community has devised to test AC in art, music, writing, etc. (I think crosswords might present an interesting Turing-like test for AC). In music, a machine-generated score which is deemed interesting, challenging, and unique (and indistinguishable from the real work of a great master), would be a major accomplishment. Machine-generated music has a long history (cf. "Computer Models of Musical Creativity" by D. Cope; Cambridge, MA: MIT Press, 2006).

Deep Learning at the character level: With the resurgence of interest in Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), I thought it would be interesting to see how far we could go in autogenerating music. RNNs have actually been around in music generation for awhile (even with LSTM; see this site and this 2014 paper from Liu & Ramakrishnan and references therein), but we're now getting into an era where we can train on a big corpus and thus train a big, complex model. Andrej Karpathy's recent blog showed how training a character-level model on Shakespeare and Paul Graham essays could yield interesting, albeit fairly garbled, text that seems to mimic the flow and usage of

Read More

Topics: Machine Learning, Data Science

Five Takeaways on the State of Natural Language Processing

Thoughts following the 2015 "Text By The Bay" Conference

The first " Text By the Bay” conference, a new natural language processing (NLP) event from the “ Scala bythebay” organizers, just wrapped up tonight. In bringing together practitioners and theorists from academia and industry I’d call it a success, save one significant and glaring problem. 
Read More

Topics: Machine Learning, Data Science, Conferences and Workshops, Predictive Analytics

Reflecting on 8 years of hcluster, an open source clustering package

As I wrote programs I needed for my own research, I strove to open source them so that other researchers might benefit. Over the holidays, I took some time to reflect on hcluster, a Python package I wrote back in 2007 during my PhD studies.

Read More

Topics: Data Science, Software Engineering

Containerized Data Science Hackday in Berkeley

Containerization, especially with Docker, has become a central paradigm across the modern engineering stack. And as productionized data science should be an instantiation of best engineering practices, it's no wonder that building, deploying, and maintaining data science workflows can benefit immensely from containerization.

Read More

Topics: Data Science, Software Engineering

Two Unicorns of Tech: Full-Stack Engineers and General Data Scientists

On hiring those with deep knowledge and specialized talents

Read More

Topics: Careers, Data Science, Software Engineering

Welcome to the New Wise.io Technology Blog

Wise.io’s machine-learning applications are built with the non-technical user in mind.  But the farther removed our products are from technical consumers within businesses, the greater the challenges we face in building our technology and data science stacks. Generalizing and automating machine learning workflows in production for specific use cases is hard: and it is our central focus in the engineering and data science teams. We’ve already demonstrated some success here: Wise Support™ for Zendesk is our first GA application that requires little more than an OAuth from a Zendesk admin user to get amazing predictive insights for customer support.

Wise.io was founded by deep technologists with PhDs in statistics, machine learning, and astrophysics. Staying true to our roots, while we learn and grow, means keeping up with the rapidly evolving landscape of technology and methodologies. We’ve also been innovating within Wise, driven by the daily pressures that arise from exposure to real-world data. It’s a privilege to get to work with the sort of raw, noisy, big, dirty, and streaming datasets we see. This is not the sort of data that exists in the UCI repository nor in Kaggle competitions and, as such, innovation happens out of necessity. Our work in what we call Wise Labs is not algorithmic research and development in search of a problem, but is instead focused on innovative and novel approaches to real problems arising in our customer applications. Necessity is the mother of invention and it is alive and well here at Wise.

Read More

Topics: Machine Learning, Data Science, Software Engineering