Quantcast
Channel: Co.Labs
Viewing all articles
Browse latest Browse all 41841

This Startup Is Turning Farmers’ Weather Intuition Into A Big (Data) Business

$
0
0

Meteorologist Steven Bennett used to predict the weather for hedge funds. Now his startup EarthRisk forecasts extreme cold and warmth events up to four weeks ahead, much further in the future than traditional forecasts, for energy companies and utilities. The company has compiled 30 years of temperature data for nine regions and discovered patterns which predict extreme heat or cold. If the temperature falls in the hottest or coldest 25% of the historial temperature distribution, Earthrisk defines this as an extreme event and the company's energy customers can trade power or set their pricing based on its predictions. The company's next challenge is researching how to extend hurricane forecasts from the current 48 hours to up to 10 days.

How is your approach to weather forecasting different from traditional techniques?

Meteorology has traditionally been pursued along two lines. One line has a modeling focus and has been pursued by large government or quasi-government agencies. It puts the Earth into a computer-based simulation and that simulation predicts the weather. That pursuit has been ongoing since the 1950s. It requires supercomputers, it requires a lot of resources (The National Oceanic and Atmospheric Administration in the U.S. has spent billions of dollars on its simulation) and a tremendous amount of data to input to the model. The second line of forecasting is the observational approach. Farmers were predicting the weather long before there were professional meteorologists and the way they did it was through observation. They would observe that if the wind blew from a particular direction, it's fair weather for several days. We take the observational approach, the database which was in the farmer's head, but we quantify all the observations strictly in a statistical computer model rather than a dynamic model of the type of the government uses. We quantify, we catalog, and we build statistical models around these observations. We have created a catalog of thousands of weather patterns which have been observed since the 1940s and how those patterns tend to link to extreme weather events one to four weeks after the pattern is observed.

Which approach is more accurate?

The model-based approach will result in a more accurate forecast but because of chaos in the system it breaks down 1-2 weeks into the forecast. For a computer simulation to be perfect we would need to observe every air parcel on the Earth to use as input to the model. In fact, there are huge swathes of the planet, e.g., over the Pacific Ocean, where we don't have any weather observations at all except from satellites. So in the long range our forecasts are more accurate, but not in the short-range.

What data analysis techniques do you use?

We are using Machine Learning to link weather patterns together, to say when these kind of weather patterns occur historically they lead to these sorts of events. Our operational system uses a Genetic Algorithm for combining the patterns in a simple way and determining which patterns are the most important. We use NaiveBayes to make the forecast. We forecast, for example, that there is 60% chance that there will be an extreme cold event in the northwestern United States three weeks from today. If the temperature is a quarter of a degree less than that cold event threshold, then it's not a hit. We are in the process of researching a neural network, which we believe will give us a richer set of outputs. With the neural network we believe that instead of giving the percentage chance of crossing some threshold, we will be able to develop a full distribution of temperature output, e.g., that it will be 1 degree colder than normal.

How do you run these simulations?

We update the data every day. We have a MatLab-based modeling infrastructure. When we do our heavy processing, we will use hundreds of cores in the Amazon cloud. We do those big runs a couple of dozen times a year.

How do you measure forecast accuracy?

Since we forecast for extreme events, we use a few different metrics. If we forecast an extreme event and it occurs, then that's a hit. If we forecast an extreme event and it does not occur, that's a false alarm. Those can be misleading. If I have made one forecast and that forecast was correct, then I have a hit rate of 100% and a false alarm rate of 0%. But if there were 100 events and I only forecasted one of them and missed the other 99, that's not useful. The detection rate is the number of events that occur which we forecast. We try to get a high hit rate and detection rate but in a long-range forecast detection rate is very, very difficult. Our detection rate tends to be around 30% in a 3-week forecast. Our hit rate stays roughly the same at one week, 2 week, 3 weeks. In traditional weather forecasting the accuracy gets dramatically lower the further out you get.

Why do you want to forecast hurricanes further ahead?

The primary application for longer lead forecasts for hurricane landfall would be in the business community rather than for public safety. For public safety you need to make sure that you give people enough time to evacuate but also have the most accurate forecast. That lead time is typically two to three days right now. If people evacuate and the storm does not do damage in that area, or never hits that area, people won't listen the next time the forecast is issued. Businesses understand probability so you can present a risk assessment to a corporation which has a large footprint in a particular geography. They may have to change their operations significantly in advance of a hurricane so if it's even 30% or 40% probability then they need extra lead time.

What data can you look at to provide an advance forecast?

We are investigating whether building a catalog of synoptic (large scale) weather patterns like the North Atlantic oscillation will work for predicting hurricanes, especially hurricane tracks--so where a hurricane will move. We have quantified hundreds of weather patterns which are of the same amplitude, hundreds of miles across. For heat and cold risks we develop an index of extreme temperature. For hurricanes the primary input is an index of historic hurricane activity rather than temperature. Then you would use Machine Learning to link the weather patterns to the hurricane activity. All of this is a hypothesis right now. It's not tested yet.

What’s the next problem you want to tackle?

We worked with a consortium of energy companies to develop this product. It was specifically developed for their use. Right now the problems we are trying to solve are weather related but that's not where we see ourselves in two or five years. The weather data we have is only an input to a much bigger business problem and that problem will vary from industry to industry. What we are really interested in is helping our customers solve their business problems. In New York City there's a juice bar called Jamba juice. Jamba Juice knows that if the temperature gets higher than 95% degrees in an afternoon in the summer they need extra staff since more people will buy smoothies. They have quantified the staff increase required (but they schedule their staff one week in advance and they only get a forecast one day in advance). They use a software package with weather as an input. We believe that many business are right on the cusp of implementing that kind of intelligence. That's where we expect our business to grow.


Keep reading: Who's Afraid of Data Science? What Your Company Should Know About Data

This story tracks the cult of Big Data: The hype and the reality. It’s everything you ever wanted to know about data science but were afraid to ask. Read on to learn why we’re covering this story, or skip ahead to read previous updates.

Take lots of data and analyze it: That’s what data scientists do and it’s yielding all sorts of conclusions that weren’t previously attainable. We can discover how our cities are run, disasters are tackled, workers are hired, crimes are committed, or even how Cupid's arrows find their mark. Conclusions derived from data are affecting our lives and are likely to shape much of the future.


Previous Updates

A roomful of confused-looking journalists is trying to visualize a Twitter network. Their teacher is School of Data“data wrangler” Michael Bauer, whose organization teaches journalists and non-profits basic data skills. At the recent International Journalism Festival, Bauer showed journalists how to analyze Twitter networks using OpenRefine, Gephi, and the Twitter API.

Bauer's route into teaching hacks how to hack data was a circuitous one. He studied medicine and did postdoctoral research on the cardiovascular system, where he discovered his flair for data. Disillusioned with health care, Bauer dropped out to become an activist and hacker and eventually found his way to the School of Data. I asked him about the potential and pitfalls of data analysis for everyone.

Why do you teach data analysis skills to “amateurs”?

We often talk about how the digitization of society allows us to increase participation, but actually it creates new kinds of elites who are able to participate. It opens up the existing elites so you don't have to be an expensive lobbyist or be born in the right family to be involved, but you have to be part of this digital elite which has access to these tools and knows how to use them effectively. It's the same thing with data. If you want to use data effectively to communicate stories or issues, you need to understand the tools. How can we help amateurs to use these tools? Because these are powerful tools.

If you teach basic data skills, is there a danger that people will use them naively?

There is a sort of professional elitism which raises the fear that people might misuse the information. You see this very often if you talk to national bureaus of statistics, for example, who say “We don't give out our data since it might be misused.” When the Open Data movement started in the U.K. there was a clause in the agreement to use government data which said that you were not allowed to do anything with it which might criticize the government. When we train people to work with data, we also have to train them how to draw the right conclusions, how to integrate the results. To turn data into information you have to put it into context. So we break it down to the simplest level. What does it mean when you talk about the mean? What does it mean if you talk about average income? Or does it even make sense to talk about the average in this context?

Are there common pitfalls you teach people to avoid?

We frequently talk about correlation-causation. We have this problem in scientific disciplines as well. In Freakonomics, economist Steven D. Levitt talks about how crime rates go down when more police are employed, but what people didn't look at was that this all happened in times of economic growth. We see this in medical science too. There was this idea that because women have estrogen they are protected from heart attacks so you should give estrogen to women after menopause. This was all based on retrospective correlation studies. In the 1990s someone finally did a placebo controlled randomized trial and they discovered that hormone replacement therapy doesn't help at all. In fact it harms the people receiving it by increasing the risk of heart attacks.

How do you avoid this pitfall?

If you don't know and understand the assumptions that your experiment is making, you may end up with something completely wrong. If you leave certain factors out of your model and look at one specific thing, that's the only specific thing you can say something about. There was this wonderful example that came out about how wives of rich men have more orgasms. A University in China got hold of the data for their statistics class and they found that they didn't use the education of the women as a parameter. It turns out that women who are more educated have more orgasms. It had nothing to do with the men.

What are the limitations of a using single form of data?

That's one of the dangers of looking at Twitter data. This is the danger of saying that Twitter is democratizing because everyone has a voice, but not everyone has a voice. Only a small percentage of the population use this service and a way smaller proportion are talking a lot. A lot of them are just reading or retweeting. So we only see a tiny snapshot of what is going on. You don't get a representative population. You get a skew in your population. There was an interesting study on Twitter and politics in Austria which showed that a lot of people on there are professionals and they are there to engage. So it's not a political forum. It's a medium for politicians and people who are around politics to talk about what they are doing.

Any final advice?

Integrate multiple data sources, check your facts, and understand your assumptions.


Charts can help us understand the aggregate but they can also be deeply misleading. Here's how to stop lying with charts, without even knowing it. While it’s counterintuitive, charts can actually obscure our understanding of data--a trick Steve Jobs has exploited on stage at least once. Of course, you don’t have to be a cunning CEO to misuse charts; in fact, If you have ever used one at all, you probably did so incorrectly, according to visualization architect and interactive news developer Gregor Aisch. Aisch gave a series of workshops at the International Journalism Festival in Italy, which I attended last weekend, including one on basic data visualization guidelines.

“I would distinguish between misuse by accident and on purpose,” Aisch says.”Misuse on purpose is rare. In the famous 2008 Apple keynote from Steve Jobs , he showed the market share of different smartphone vendors in a 3D pie chart. The Apple slice of the smartphone market, which was one of the smallest, was in front so it appeared bigger.”

Aisch explained in his presentation that 3-D pie charts should be avoided at all costs since the perspective distorts the data. What is displayed in front is perceived as more important than what is shown in the background. That 19.5% of market share belonging to Apple takes up 31.5% of the entire area of the pie chart and the angles are also distorted. The data looks completely different when presented in a different order as shown below.

In fact, the humble pie charts turns out to be an unexpected mine field:

“Use pie charts with care, and only to show part of whole relationships. Two is the ideal number of slices, but never show more than five. Don’t use pie charts if you want to compare values. Use bar charts instead.”

For example, Aisch advises that you don’t use pie charts to compare sales from different years but do use to show sales per product line in the current year. You should also ensure that you don't leave out data on part of the whole:

“Use line charts to show time series data. That’s simply the best way to show how a variable changes over time. Avoid stacked area charts, they are easily mis-interpreted.”

The “I hate stack area charts” post cited in Aisch’s talk explains why:

“Orange started out dominating the market, but Blue expanded rapidly and took over. To the unwary, it looks like Green lost a bit of market share. Not nearly as much as Orange, of course, but the green swath certainly gets thinner as we move to the right end of the chart.”

In fact the underlying data shows that Green’s market share has been increasing, not decreasing. The chart plots the market share vertically, but human beings perceive the thickness of a stream at right angles to its general direction.

Technology companies aren't the only offenders in chart misuse. “Half of the examples in the presentation are from news organizations. Fox News is famous for this,” Aisch explains.”The emergence of interactive online maps has made map misuse very popular, e.g. Election results in the United States where big states like Texas which have small populations are marked red. If you build a map with Google Maps there isn't really a way to get around this problem. But other tools aren't there yet in terms of user interface and you need special skills to use them.”

Google Maps also uses the Mercator projection, a method of projecting the sphere of the Earth onto a flat surface, which distorts the size of areas closer to the polar regions so, for example, Greenland looks as large as Africa.

The solution to these problems, according to Aisch, is to build visualization best practices directly into the tool as he does in his own open source visualization tool Datawrapper. “In Datawrapper we set meaningful defaults but also allow you to switch between different rule systems. There's an example for labeling a line chart. There is some advice that Edward Tufte gave in one of his books and different advice from Donna Wong so you can switch between them. We also look at the data so if you visualize a data set which has many rows, then the line chart will display in a different way than if there were just 3 rows.”


The rush to "simplify" big data is the source of a lot of reductive thinking about its utility. Data science practitioners have recently been lamenting how the data gold rush is leading to naive practitioners deriving misleading or even downright dangerous conclusions from data.

The Register recently mentioned two trends that may reduce the role of the professional data scientist before the hype has even reached its peak. The first is the embedding of Big Data tech in applications. The other is increased training for existing employees who can benefit from data tools.

"Organizations already have people who know their own data better than mystical data scientists. Learning Hadoop is easier than learning the company’s business."

This trend has already taken hold in data visualization, where tools like infogr.am are making it easy for anyone to make a decent-looking infographic from a small data set. But this is exactly the type of thing that has some data scientists worried. Cathy O' Neil (aka MathBabe) has the following to say in a recent post:

"It’s tempting to bypass professional data scientists altogether and try to replace them with software. I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well."

K-nearest neighbors is a method for classifying objects, let's say visitors to your website, by measuring how similar they are to other visitors based on their attributes. A new visitor is assigned a class, e.g. "high spenders," based on the class of its k nearest neighbors, the previous visitors most similar to him. But while the algorithm is simple, selecting the correct settings and knowing that you need to scale feature values (or verifying that you don't have many redundant features) may be less obvious.

You would not necessarily think about this problem if you were just pressing a big button on a dashboard called “k-NN me!”


Here are four problems that typically arise from a lack of scientific rigor in data projects. Anthony Chong, head of optimization at Adaptly, warns us to look out for "science" with no scientific integrity.

Through phony measurement and poor understandings of statistics, we risk creating an industry defined by dubious conclusions and myriad false alarms.... What distinguishes science from conjecture is the scientific method that accompanies it.

Given the extent to which conclusions derived from data will shape our future lives, this is an important issue. Chong gives us four problems that typically arise from a lack of scientific rigor in data projects, but are rarely acknowledged.

  1. Results not transferrable
  2. Experiments not repeatable
  3. Not inferring causation: Chong insists that the only way to infer causation is randomized testing. It can't be done from observational data or by using machine learning tools, which predict correlations with no causal structure.
  4. Poor and statistically insignificant recommendations.

Even when properly rigorous, analysis often leads to nothing at all. From Jim Manzi's 2012 book, Uncontrolled: The Surprising Payoff of Trial-and-Error for Business:

"Google ran approximately 12,000 randomized experiments in 2009, with [only] about 10 percent of these leading to business changes.”


Understanding data isn't about your academic abilities—it's about experience. Beau Cronin has some words of encouragement for engineers who specialize in storage and machine learning. Despite all the backend-as-service companies sprouting up, it seems there will always be a place for someone who truly understands the underlying architecture. Via his post at O'Reilly Radar:

I find the database analogy useful here: Developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems—i.e., the “professionals.” How? Well, decades of experience—and lots of trial and error—have yielded good abstractions in this area.... For ML (machine learning) to have a similarly broad impact, I think the tools need to follow a similar path.


Want to climb the mountain? Start learning about data science here. If you know next to nothing about Big Data tools, HP's Dr. Satwant Kaur's 10 Big data technologies is a good place to start. It contains short descriptions of Big Data infrastructure basics from databases to machine learning tools.

This slide show explains one of the most common technologies in the Big Data world, MapReduce, using fruit while Emcien CEO Radhika Subramanian tells you why not every problem is suitable for its most popular implementation Hadoop.

"Rather than break the data into pieces and store-n-query, organizations need the ability to detect patterns and gain insights from their data. Hadoop destroys the naturally occurring patterns and connections because its functionality is based on breaking up data. The problem is that most organizations don’t know that their data can be represented as a graph nor the possibilities that come with leveraging connections within the data."

Efraim Moscovich's Big Data for conventional programmers goes into much more detail on many of the top 10, including code snippets and pros and cons. He also gives a nice summary of the Big Data problem from a developer's point of view.

We have lots of resources (thousands of cheap PCs), but they are very hard to utilize.
We have clusters with more than 10k cores, but it is hard to program 10k concurrent threads.
We have thousands of storage devices, but some may break daily.
We have petabytes of storage, but deployment and management is a big headache.
We have petabytes of data, but analyzing it is difficult.
We have a lot of programming skills, but using them for Big Data processing is not simple.

Infochimps has also created a nice overview of data tools (which features in TechCrunch's top five open-source projects article) and what they are used for.

Finally, GigaOm's programmer's guide to Big Data tools covers an entirely disjointed set of tools weighted towards application analytics and abstraction APIs for data infrastructure like Hadoop.


We're updating this story as news rolls in. Check back soon for more updates.


[Image: Flickr user Mahalie Stackpole]


Viewing all articles
Browse latest Browse all 41841

Trending Articles