Oriol.CodesData Science and Coding

Data Science for the Rest of Us

Posted on February 22, 2019 · 8 min read

A few days ago, I read Randy Au’s Medium story “Succeeding as a data scientist in small companies/startups” and, as I wrote in a comment to Randy’s article, I immediately felt that he was touching on a subject that requires much more attention from the data science community than it has received so far. Let me put it in all caps so that I can impress the importance of this point: MOST OF US DO NOT WORK AT GOOGLE.

Now, of course, I don’t mean that there’s anything wrong with working at Google (not that I’m resentful or anything), and when I say Google I really mean “company with tons of data, massive infrastructure, and a well-established data-driven culture.” I imagine that working at one of these companies is like a kind of data science nirvana, where when you turn on your workstation in the morning you jump on a soft blanket and are smoothly carried over a collection of well-defined problems, servers with endless memory and GPUs, and supportive bosses who can’t wait to congratulate you on what an amazing job you did running that BERT network that made the company another million dollars by lunch time. OK, I might be getting a little carried away about what life is on the other side of this particular digital divide, but the point that I’m trying to make is that all the talk about “data scientist” being the sexiest job of the 21st century and all the hype about the latest machine learning techniques do project an image of what being a data scientist is that doesn’t correspond to life in the trenches.

Sure, we repeat ad nauseam that data scientists spend most of their time cleaning data, and we’re starting to hear some voices saying that not everything is green in the data science job landscape (I heard that Vicki Boykis’s recent article has given minor heart attacks to more than one about-to-graduate bootcamp student). Yet when we look at the resources available to those who are considering a career in data science, and particularly the marketing put forth by the many education startups that offer job guarantees, it is hard not to think of data science as a glamorous career path that will allow you to ride your well-honed math skills into a future of success and riches.

Now, don’t get me wrong: data science DOES present amazing opportunities and, if the n=1 of my humble experience can be used as evidence, it IS a wonderful job full of excitement and intellectual stimulation (I will take it any day over my old gig as a university professor). Yet I believe that most people getting into the field have little idea of what doing data science involves in the thousands of small and new organizations that want to jump onto the data science bandwagon and are pulling the trigger on hiring data scientists without having a clear idea of what they want us for. Thus the importance of Randy’s article and why I decided to piggyback on his efforts and share my views on the matter: instead of writing yet another tutorial on logistic regression, let’s tell people who are considering a job in data science about the non-glamorous parts of our days so that they don’t feel cheated and know what they’re getting into.

I am planning on writing a whole series on the challenges of doing data science in small companies and startups in the future, but today I want to focus on what I see as a cornerstone of data science work: helping to build an organizational culture that puts rigorous data analysis at the very center of all decision-making. Now, reading this might put you in a catatonic state, but not all companies share this view of the role of data. Even more: many companies think that they are data-driven when in fact their use of data is spotty and inconsistent at best. And yes: they are hiring data scientists.

The reason I’m mentioning this is not to lambast and shame these companies, quite the contrary. The point I’m trying to make is that instead of dismissing these firms and looking for work elsewhere, data scientists should see doing what theses companies need as a central part of our job. A well-established company with a healthy data culture will know exactly what it needs and hire someone to do a specific job (cue the growing literature on the 2, 3, 5, 7, or 10 different types of data scientists). But many organizations are not there yet, and a data scientist working for them will not be able to spend all her time building ultra-exciting billion-layer ANNs. Often, not only the data or the infrastructure won’t be there, but not even the expectation or the perceived need. What will our brand new data scientist do when no one seems to really care about data in company X? Here’s a (surely incomplete) list of what I think she should do:

  1. Ask for data all the time, like, constantly: there are probably plenty of talented people who know their job in and out in the organization but who, at best, have never thought about the systematic use of data to inform their decisions, or, at worst, are actively hostile to anyone else telling them how to do their job. Please don’t get cocky and act all high and mighty because you know what a gradient boosting machine is here. Collect any data you can get your hands on, ask for people to document what they do even if it’s in random spreadsheets, put them in a way that is conducive to analysis, and try to gather insights that will improve the way the firm works. Present them in a humble and constructive, non-threatening way, and people will eventually start using them.Then you’ll be flooded by requests for help and you’ll never have lunch away from your desk again.
  2. Ask questions all the time: often, people don’t know what they don’t know, and it’s by asking questions that the need for new and/or systematic data will emerge. This will also provide you with the kind of domain knowledge that a data scientist needs to be truly effective. Your random forest will not be of much help is you don’t understand what kinds of features are relevant and what is important to predict and why. You need to understand the business and its needs in order to think about metrics and measurement, and the people whose work you are going to impact need to perceive you as a someone who understands the challenges they face. When there are no clear answers to your questions, you will have an opportunity to push for data collection, systematization, and analysis.
  3. Ask why all the time: many times, things are done in a certain way because they’ve always been done like that. Some of those times, there is a good reason behind the current workflow. Others, the good reason is way way in the past and it’s not a relevant factor anymore. And still other times there is no reason at all, and the status quo is just the solidified outcome of ad hoc decisions and serendipity. Learn to identify where there is no good rationale for how things are done and where an in-depth exploration of the issue with data is warranted. If you can build a good case for a change of course (i.e., one that will bring tangible benefits to the organization), people will start to see the upside of your approach and are more likely to embrace it.
  4. Ask for help all the time: the goal of the previous points is to help your organization develop a data culture. If you do your work in isolation and only emerge from your cave once in a while to show everyone else how smart you are you will not only get less done and have little actual impact, but people will see you with overt animosity (and might steal your yoghurt from the fridge). For once, you are bound to miss crucial stuff that will make it easy to dismiss your work. But more importantly, people need to see you working with them on the issues that they think are relevant. That’s why you ask them for data and ask them questions. If those who will be affected by the product of your work have a sense of ownership over it, they are much more likely to engage with and adopt it.
  5. Ask for the chance to share what you found all the time: you’ve probably heard before that communication is central to data analysis (what’s the point of your cool new algorithm if nobody understands why you used it or what it does?), but I cannot overemphasize the importance of communicating what you learn. Notice that I said “what you learned,” not “what you did.” Nobody gives a damn about what a support vector machine does, and the communication I’m talking about has little to do with the technical aspects of your work. It’s great that you are so excited about the beauty of backpropagation, but the people in your firm want to know how to fix and improve the problems that they deal with every day in sales, production, marketing, customer relations, or whatever other area of the organization you didn’t even know existed. Talk to people not only to learn from them, but to figure out what the best way to talk to them about what you do and why it matters is. Don’t take for granted that everyone wants to hear you pontificate because you have a Metis certificate of completion stapled to your cubicle wall. Engage in a two-way dialogue and show how what you do is useful to others in their own terms.

All these points might not sound very glamorous, and to some will even sound like mere common sense. But I still have to see a school that teaches these concepts, or an interviewer who tries to assess the associated skills. Let me insist once more: I’m not saying that technical skills are not essential, and in many jobs they may be the most important asset for a data scientist. But in many contexts your non-technical skills will be a much better predictor of your success, both personally and in terms of your impact on the organization. Even more, your soft skills and your effect on the company’s culture are possibly the only way that you can get to do the fancy stuff down the road once the data, the infrastructure, and the need for analysis are in place and well-established.

I hope that you also keenly perceived my subtle hints and see that all the ‘asking’ I advocated for above needs to be accompanied with a certain attitude that combines humility with genuine engagement. In a company, learning is a collective process, and the main role of the data scientist is to help the company learn and improve by using all the data that it is able to gather, process and leverage. You cannot do this on your own, and knowing your place in the corporate culture and how you can have a meaningful impact on it is one of the most important soft skills a data scientist can bring to the table.