Why is Big Data so painful to deal with? - Data ScienceTech Institute

Why is Big Data so painful to deal with?

Share this Post

Big Data is really the latest hype in your managers’ mouth.

“Why do we not use our vast amount of data in a more efficient way?
We need to be just as smart with our data as Google, Amazon, Facebook and Apple are!
When will we have a clear Big Data strategy?”

 These are probably a few questions you may read from your top-management in the almighty monthly corporate newsletter.

But if everyone understands clearly that Big Data is an observable phenomenon of exponential growth of acquired data by all organisations, it doesn’t explain what we should do with it.

This snake biting its own tail question has existed since the adolescence of computer science: after the Founding Fathers (Von Neumann, Turing, Codd, etc.) the science quickly took a sharp turn towards “answering” both developers and end-users needs rather than proactively proposing meaningful concepts to deal with real-world challenges.

Starting in the 80’s, this pattern became very clear with programming language engineering: object-oriented programming came to triumph because it made everything simpler for developers, but no one ever really wondered where you may find concepts such as classes, encapsulation, inheritance, overloading or polymorphism in the real world.

Analysts and (mostly) designers had then to fit these concepts into systems’ blueprints: the Universal Modelling Language (UML) with its thousands of pages of confusing -at best- or contradictory instructions finally came into being.

Comparing the inception of UML with Codd’s relational model is almost too good a case study to be true. On one hand, you get around 2,000 pages of descriptions and recommendations on how to draw a given diagram for a possible “use-case”, trying really hard to consistently keep-up with the new techniques created almost yearly by software engineering.

On the other, you get a concise, mathematically proven model described in a less-than 20-page article, written in the 70’s, on how to get the best possible data structures along with relational algebra, a formal way for manipulating them.

Let alone the fact that relational databases systems allow for the comfortable way of modern life to exist (you know: your banking, most of internet content, etc.), it’s completely agnostic to software engineering techniques and can be designed using Entity/Relationship modelling, which only takes a few hours to understand.

Mathematics rules! It always has and always will. It makes a theory sound and its applications straightforward, as long as you can understand its equations.

And this makes a link with Big Data. Big Data raises two challenges:

  1. Storing and processing unstructured data (natural language, sound, video, etc.) mostly triggered by “social” or web X.0 activities (with X ≥= 0);
  2. A combinatorics problem.

I shall leave the first as the discussion for a future blog. But the second is rather simple to explain: take a local supermarket with 300 products on the shelves. It won’t really generate “Big Data” as of now understood to be in petabytes or above.

But if you try and analyse a simple cross-sale question (aka “association rule”):

“if customer buys product X then he buys product Y”

You need to store table-records where each customer is a row and each product a column with a binary outcome: 0 = no, 1 = yes.

Well, that’s 2300 possible outcomes per row. And would a few Intel Core i7 CPUs deal with that in pure combinatorics exploration?

No, they wouldn’t: this number is bigger than the number of atoms in the currently known universe.

 Now, imagine the combinatorial potential for Wal-Mart or Amazon …

That’s why we need specific algorithms to deal with Big Data analytics. Traditional statistics are not made to deal either with this combinatorial space or to search for “something” for which we don’t know the question beforehand.

We need mathematics here to guide us towards reducing the lattice-space of all possibilities to make it computationally possible.

That’s number and graph theory, and it’s not for the faint-hearted. It’s somewhere between top-class engineering and research.

The APriori algorithm was a pioneer in the field of “association-mining”, but research is still very active: at the time of APriori’s release, social networks didn’t exist nor did local corner shops have computerised tills.

Big Data is not only big in storage space: it’s exponentially bigger than <whatever>bytes of it!
Don’t be lost in combinatorics, apply to Data ScienceTech Institute programmes!

About the Author

Sébastien Corniglion

Facebook Twitter

CEO for Teaching and Research, Data ScienceTech Institute