How to measure entrepreneurial ecosystems

I love reading data-driven articles on entrepreneurial ecosystems, and Nick Beim's new article "The Rise and Future of The New York Startup Ecosystem" is no exception. What's unique about it is that is uses two measures, total amount of VC investment and exists about 500 million,  to compare ecosystems. Nick shows that while NYC's ecosystem might still be small by Silicon Valley terms, it's the same size as as Boston (if not bigger) despite the fact that NYC lacks a high profile research university like  MIT (Colombia stands in the corner of the room and looks at the floor in embarrassment; no one even notices NYU).

Both these metrics are nice because they give us real, comparable numbers. Comparing regions is always a difficult process because of a lack of good, comparable data that actually talks about entrepreneurship specifically.

However, there are some problems with these measures. VC isn't geographically neutral: venture capitalists tend to invest in firms near them (I could give you so many citations for this but just throw 'venture capital geography' into google scholar and go wild). So, places with more VC firms are likely to have more VC investments. This is called a 'Matthew Effect,' meaning that places with an already existing advantage continue to get that advantage at the expense of worse off places. So, places with lots of existing VC investment will attract more VC firms, leading to higher levels of investment. Now, this isn't deterministic: New York-based venture capital firms are increasingly investing in firms in Toronto and Ottawa. Two venture funds just put over 100 million into Ottawa's Shopify last year. Similarly, without this kind of financial backing, it's hard to get a $500 million + exit.

Because of this, there are maybe 5-10 cities in America where we'd expect to see enough venture capital invested to actually put the data in a spreadsheet and make a pretty graph with out resorting to logging everything.  But I think that there are more than 10 entrepreneurial ecosystems in America. We need to find better metrics that allow us to identify them in ways that go beyond VC investments and exist.

However, this means a lot more work on the part of researchers. It's easy to get VC data if you're willing to pay; it's much harder to figure out the contours of a regional culture or count how many mentors there are within a community. This requires a more in-depth, case study approach. I'm just beginning to think about how we can measure ecosystems in a way that gets at all these hidden factors but at the same time allows us to systematically compare different regions.

One last note: Nick's article is particularly nice because it actually mentions cost of living. Major ecosystems like NYC, Boston, and San Fran are having a cost-of-living crisis. It's going to become increasingly hard for entrepreneurs, especially young ones, to actually live and work in these places. How entrepreneurs support themselves prior to being bought by Facebook for 19 billion dollars is going to become an increasingly important question as time goes on.

Big Data and Deep Data

I'm officially done with my dissertation — It's been handed into to ml committee and I couldn't change anything, even if I wanted to. This puts me in an odd position: for the past 24 months most of my days were spent working on my dissertation, either analyzing my interviews, outlining my ideas, writing or editing. Being done with this has left a pretty big hole in my daily schedule. I've started work on a few other projects to fill this gap, projects that have me working with entirely new types of data than in my dissertation. My dissertation research was interview based. I conducted 110 interviews which produced something like 70 hours of tape and over 3000 pages of transcripts. I have lots of detail on the 80 entrepreneurs I talked to. I know how and why they started their company, how they raised money from investors or why they've avoided it, the challenges they've faced and what they did to overcome them, if they've networked with other entrepreneurs and what they talked about.

This data is amazingly deep, but in the grand scheme of things it's very small. I talked to about 1/3 of the high-tech entrepreneurs in each city who happened to be on a business directory I used. So, when I found really cool things in my interviews, like the fact that most entrepreneurs in Waterloo actively searched through their own social networks to find mentors but those in Ottawa mostly relied on their parents or former business partners to provide business advice, it's hard to say if this is something True for everyone in the city or if it was just a coincidence. There are a few statistical tests to try to figure out what's real and what's an illusion, but they can only go so far.

The new project I'm working on gives me access to fantastic datasets about innovation and economic development in Canada. This includes the famous Dun and Bradstreet directory, which is the biggest dataset I've ever played with. Clocking in at 1.5 gigabytes, it contains information on more than 1.5 million Canadian firms. I would consider this to be on the very small end of 'big data.' For someone studying entrepreneurship, this is a godsend. I can now tell you, for instance, between 2001 and 2006, there were 669 new high tech firms founded in Toronto* and that the average sales of these firms are around $360,000. I can also make really cool pictures like this, which shows that there is a positive relationship between the proportion of immigrants in a region and the proportion of high tech firms in every province except Saskatchewan and New Brunswick.

But as I work more and more with this data, I'm beginning to see its limitations. I know things about a whole lot of firms, but I don't know much about them. With the D&B data, I essentially know a firm's name, it's address, what year it was founded, what industry it's in, how many employees they have and a guess about their sales number. In aggregate, these data can tell me many things — which regions have the most startups, which industries seem to grow the fastest, what's the relationship between workers and sales across the entire country. But it also raises lots of questions that the data can never answer.

Looking at one record at random, I know that Bait Consulting Inc. of Thornhill is a consulting company that was formed in 2001 and which has one employee and an estimated 120,000 in sales. But unlike in my dissertation research, I don't know anything more. I don't know why the company was founded, I don't know why it was founded in Thornhill instead of Toronto or Mississagua or Cambridge. I don't know how its founder learns about the market or finds new customers.It's difficult to figure out if a government policy is working from this data, or how an entrepreneur is affected by where they live.

That's the big difference between big data and what I'd call deep data. Big data can tell you a small number of things about a whole lot of things. You can do a whole lot with this, but you always need to be aware what it's not telling you. Only so many different questions can be asked on surveys — the more you ask, the fewer people will respond.

Qualitative data collected through long, semi-structured interviews, is deep data. I know a lot of about the people I talked to. Not everything, and many of the responses are biased by the respondent wanting me to think they are really skilled entrepreneurs. I know more than a binary variable, I know what they did, why they did it, and what that has caused. I can understand what practices they took to start and grow their firm and relate those back to their larger cultural context. But again, there's that tradeoff: I know a lot about a very small number of people. And I have it easy, people doing ethnography or observational research will have hundreds and hundreds of hours of recordings or notes about an even smaller range of people.

It would be nice to think that we can meet in the middle, but working with big qualitative datasets requires a totally different set of skills than working with big quantitative datasets. Very few people are equally as able to produce a grounded analysis of a collection of interviews and a Baysian analysis of a census dataset. But there is value in each, and the challenge is being able to figure out the right way to collect data to solve a problem. The platonic ideal is for quantitative and qualitative data to be used together to prove a larger point, but this kind of research is expensive and rare. But it might be the only way to get a real sense of what's going on in the world around us.

*This seems really low to me and I'm already working with librarians and others to figure out the proportion of all firms the D&B directory accounts for