April 10, 2014

Intro to Big Data

2 minute read


Last Friday, I had the chance to attend a panel of career technologists and data scientists who all work in the fascinating field known as “big data”. “Big data” is an umbrella term to describe data whose qualities are so atypical that classical data analysis techniques fail to even begin to tackle understanding them. The data can be in any conceivable format and in any field of study. For instance, in news analysis, “How are news articles relevantly accessed when more arrive in a single day than a person can read in a lifetime?” Or for internet security, “Can login attempt information from internet users across the world predict when a potential login request is coming from a spam-bot?” Perhaps most interestingly, “Can anonymous medical patient data be leveraged to provide advanced diagnostic capabilities?” (These are all real examples of active research being conducted by members of the the panel I attended.)

Clearly these questions all ask about completely different fields: media, information security, and medicine. Yet they have a unifying theme; the data involved are quite large. “Big” here becomes a general term to refer to any such data. The rule of thumb is that data which meet two of the “three V’s” is Big: data with high Volume, Velocity, and Variety of information.

When datasets are as large as in the examples, the amount of information is so unfathomably large that not even traditional computers using classical techniques can plumb the troves of data which can be acquired. Formerly, it was possible to set up a single computer to analyze information over night, and examine the results in the morning. This is no longer possible. Instead, supercomputers essentially composed of many dozens or hundreds of other computers distribute the workload within and across themselves and run for many thousands of compute-hours. Knowledge of how to program to these supercomputers is itself an art. The growing field of Machine Learning is the study of how to program computers to be able to recognize patterns in data and understand them.

© Jeff Rabinowitz, 2023