Last Friday, I had the chance to attend a panel of career technologists and data scientists who all work in the
fascinating field known as “big data”. “Big data” is an umbrella term to describe data whose qualities are so atypical
that classical data analysis techniques fail to even begin to tackle understanding them. The data can be in any
conceivable format and in any field of study. For instance, in news analysis, “How are news articles relevantly accessed
when more arrive in a single day than a person can read in a lifetime?” Or for internet security, “Can login attempt
information from internet users across the world predict when a potential login request is coming from a spam-bot?”
Perhaps most interestingly, “Can anonymous medical patient data be leveraged to provide advanced diagnostic
capabilities?” (These are all real examples of active research being conducted by members of the the panel I attended.)
Clearly these questions all ask about completely different fields: media, information security, and medicine. Yet they
have a unifying theme; the data involved are quite large. “Big” here becomes a general term to refer to any such data.
The rule of thumb is that data which meet two of the “three V’s” is Big: data with high Volume, Velocity, and Variety of
information.
When datasets are as large as in the examples, the amount of information is so unfathomably large that not even
traditional computers using classical techniques can plumb the troves of data which can be acquired. Formerly, it was
possible to set up a single computer to analyze information over night, and examine the results in the morning. This is
no longer possible. Instead, supercomputers essentially composed of many dozens or hundreds of other computers
distribute the workload within and across themselves and run for many thousands of compute-hours. Knowledge of how to
program to these supercomputers is itself an art. The growing field of Machine Learning is the study of how to program
computers to be able to recognize patterns in data and understand them.