## For my new followers here

I noticed that this blog has attracted several followers in the last few days. I suspect this is because this (neglected) blog is under the same WordPress account of my main blog, dimitriosdiamantaras.me. If you are following me on cogiddo.wordpress.com (the one I am writing this post in) because you want to see my new photos and other posts, I thank you and I suggest you follow dimitriosdiamantaras.me.

## Laurence J. Peter quotation on economists

### Quotation #1233 from Michael Moncur’s (Cynical) Quotations:

An economist is an expert who will know tomorrow why the things he predicted yesterday didn’t happen today.
Laurence J. Peter
US educator & writer (1919 – 1988)

Like I say all too often, economists cannot predict. Beware of those economists that say they can. They have a bridge to sell you, a beautiful bridge in Brooklyn. Economics needs several more centuries of research before it can develop decent predictive capabilities, and I wouldn’t bet it will even then.

(Once in a while I am really glad I follow “quotes of the day”.)

Ever wanted to have a robot to do your research for you? If you are a scientist, you have almost certainly had this dream. Now it’s a real option: Eureqa, a program that distills scientific laws from raw data, is freely available to researchers.

The program was unveiled in April, when it used readouts of a double-pendulum to infer Newton’s second law of motion and the law of conservation of momentum. It could be an invaluable tool for revealing other, more complicated laws that have eluded humans. And scientists have been clamoring to get their hands on it.

“We tend to think of science as finding equations, like E=MC2, that are simple and elegant. But maybe some theories are complicated, and we can only find the simple ones,” said Hod Lipson of Cornell University’s Computational Synthesis Lab. “Those are unreachable right now. But the algorithms we’ve developed could let us reach them.”

Eureqa is descended from Lipson’s work on self-contemplating robots that figure out how to repair themselves. The same algorithms that guide the robots’ solution-finding computations have been customized for analyzing any type of data.

The program starts by searching within a dataset for numbers that seem connected to each other, then proposing a series of simple equations to describe the links. Those initial equations invariably fail, but some are slightly less wrong than others. The best are selected, tweaked, and again tested against the data. Eureqa repeats the cycle over and over, until it finds equations that work.

What took Newton years to calculate, Eureqa returned in a few hours on a decent desktop computer. Lipson and other researchers hope Eureqa can perform the same wizardry with data that now defies scientists, especially those working at the frontiers of biology, where genomes, proteins and cell signals have proven fantastically difficult to analyze. Their interactions appear to follow rules that traditional analytical methods can’t easily reveal.

“There’s a famous quote by Emerson Pugh: ‘If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.’ I think that applies to all of biology,” said John Wikswo, a Vanderbilt University biophysicist who’s using the Eureqa engine in his own lab. “Biology is complicated beyond belief, too complicated for people to comprehend the solutions to its complexity. And the solution to this problem is the Eureqa project.”

Lipson made Eureqa available for download early in November, after being overwhelmed by requests from scientists who wanted him to analyze their data. In the meantime, he and Michael Schmidt, a Cornell University computational biologist responsible for much of Eureqa’s programming, continue to develop it.

An ongoing challenge is the tendency of Eureqa to return equations that fit data, but refer to variables that are not yet understood. Lipson likened this to what would happen if time-traveling scientists presented the laws of energy conservation to medieval mathematicians.

“Algebra was known. You could plug in the variable, and it would work. But the concept of energy wasn’t there. They didn’t have the vocabulary to understand it,” he said. “We’ve seen this in the lab. Eureqa finds a new relationship. It’s predictive, it’s elegant, it has to be true. But we have no idea what it means.”

Lipson and Schmidt are now devising “algorithms to explain what our algorithm is finding,” perhaps by relating unknown concepts to simpler, more familiar terms. “How do you explain something complicated to a child? That’s what it involves,” said Lipson. “It’s machine teaching, rather than machine learning.”

One set of incomprehensibly meaningful discoveries comes from Eureqa’s analysis of cellular readouts gathered by Gurol Suel, a University of Texas Southwestern molecular microbiologist who studies how cells divide and grow. But even if Eureqa can’t yet explain what it found, it’s still useful, said Suel.

“You can use this as a starting point for further investigations. It lets you think about new ideas of what’s going on in the cell, and generate new hypotheses about the properties of biological systems,” said Suel.

Sometimes Eureqa will require more data than it’s given before finding answers. In those cases, the program may be able to identify information gaps, and recommend experiments to fill them.

That functionality is included in the latest build of the program, and is being taken even further in a new Lipson-Wikswo project. They’re hooking a version of Eureqa directly to Wikswo’s experimental gadgetry.

“The program is going to adjust the valves, feeding different nutrients and toxins to the cells,” and it does this faster than any researcher, said Wikswo. “It comes up with the equations, plus the experiments needed to come up with the equations. It’s Eureqa on steroids.”

According to Wikswo, who studies the effects of cocaine on white blood cells, Eureqa can propose experiments that researchers would have difficulty imagining.

“In most of science, you try to keep everything constant except for one variable. You turn one knob at a time, and see how the system responds. That’s wonderful for linear systems,” he said. “But most biology is complex and non-linear. Emergent behaviors are very hard to understand unless you turn many knobs at a time, and we can’t figure out which knobs to turn. So we’re going to let Eureqa pick them.”

The Cornell team hasn’t counted downloads of their program, but it’s likely being used by researchers outside biology. As long as data fits on a spreadsheet, Eureqa can analyze it.

“In the past year, people have contacted us with some wild application ideas,” said Schmidt. “Everything from predicting the stock market to modeling the herding of cows.”

Images: 1) Hod Lipson running Eureqa in his office. 2) Diagrams of information flow through one of Lipson’s self-repairing robots (left) and Eureqa (right).

Brandon Keim’s Twitter stream and reportorial outtakes; Wired Science on Twitter. Brandon is currently working on a book about ecosystem and planetary tipping points.

Well. How about we just give up on science as a human enterprise?

## More on Type M errors in statistical analyses

A bit earlier, I was intrigued by a blog post by Columbia Statistics and Political Science professor Andrew Gelman about “Type M” errors in statistical analyses (link).  A Type M error is an overestimation of the strength of the relationship between two variables and such an error is caused by having too small a sample to draw upon.

I can try to explain this to you now this because I have now read “Of Beauty, Sex and Power” by Andrew Gelman and David Weakliem (American Scientist, Volume 97, 310-316, 2009). I found the text of the article by following a link in the Gelman post I quoted earlier. I think I now understand a little what’s going on here and I really enjoyed reading the article.

Suppose there are two variables I care to study with an eye to whether they are related. Perhaps I have a theory, based on a hypothesis from evolutionary psychology, that “Beautiful parents have more daughters”. (In fact, Gelman and Weakliem wrote their article after being prompted by a paper with this very title, and some other papers by the same author, published in the prestigious Journal of Theoretical Biology.) Let’s call these variables X and Y (behold the poverty of my imagination).

Let’s also suppose that there is in fact a relationship between these variables, but very small in magnitude. As a researcher, I do not know this relationship but I want to discover it and make my name based on the discovery. What do I do then? I go after data sets that contain variables X and Y and try some statistical estimation techniques, looking for a number to indicate how strongly the variables are related. Classical statistical methodology tells me to estimate not only that number, but also an interval around my estimate that gives an idea of the error of my estimation. This is called a “confidence interval”. (Gelman and Weakliem also explain how this argument goes if I were to use Bayesian estimations techniques, for those in my vast* readership who know what these are.) Roughly speaking, if I have done my stats well, and do the same estimation work with 100 different data sets, then the true value of the number I am after will be in 95 of the 100 confidence intervals that I will find.

But here’s the rub. What I really am testing, if I am doing classical statistics, is whether the number I want to estimate can be shown (with 95 percent confidence) to be different from some a priori estimate (the “null hypothesis”). For a relationship that is very small, presumably any previous evidence will have shown it is small, and perhaps would have shown conflicting results about the sign of the relationship: some studies would have found it negative, some positive. So I should have as my null hypothesis that X and Y are unrelated.

Now let’s say I find that this relationship coefficient that I am trying to estimate is in fact equal to 0. I do not know this, of course. If I do 100 independent studies to estimate this coefficient, then I can expect 5 of them to indicate to me that the coefficient is statistically significant from zero; all of the 5 would be misleading. But concluding that the correlation I want to find is in fact not there is not exciting, and will get me no fame. If I find one of the erroneous “significant” results, on the other hand, I will send my study to a prestigious journal, talk to some reporters, and maybe even write a book about it. All of the noise thus generated would be good for my name recognition. But I would still be wrong, having infinitely overestimated the coefficient of interest.

The same kind of error could arise if the true relationship was in fact positive. Say the coefficient was not 0 but instead 0.3, and my data allowed me an estimate with a standard error of 4.3 percent. Then I would have a 3 percent probability of estimating a positive coefficient that would appear statistically significant and, perhaps worse, a 2 percent probability of estimating a _negative_ coefficient that would appear statistically significant. I could even be strongly convinced, then about the wrong sign of my coefficient! Whichever of these two errors I fall into, the estimated coefficient will be more than an order of magnitude larger in absolute value than the true coefficient. This is why we are talking about Type M effects; M stands for magnitude, indeed. (Well, we also saw a Type S effect in this example, when the sign of the estimated coefficient was wrong.)

Is there an escape from this trap? More data would help expose my error. The more data I base my estimation on, the more the so-called “statistical power” of my testing procedure, and the less likely I will be to fall in error. For variables with small but significant correlations, which happens in the medical literature, often the data sets contain millions of observations. It is understood by sophisticated scientists that you need a lot of power (a lot of data) to tease out small effects.

What can we conclude from this? Besides the obvious value of skepticism when assessing the value of any statistical finding, we should also realize that not all studies that use statistics are created equal. Some have more power than others, and we should trust their results more. And that’s why “more research is needed” is such a refrain in discussions of studies on medical or social questions. I know “more research is needed” is also a plea for funds, and should be always met with the aforementioned skepticism, but bigger data sets do give us the power of more secure conclusions.

—-
*This poor attempt at irony is also an example of a particular Type M error, this one about the correlation of the variable “the size of the set of readers of my blog” and “vast, for not ridiculously small values of ‘vast'”. I hope you’ve heard some variation of the joke that goes something like “It is true that I have made only two mistakes in my life, for very large values of ‘two'”.