Share this Post

Big Data is really the latest hype in your managers’ mouth.

*“Why do we not use our vast amount of data in a more efficient way?
*

*We need to be just as smart with our data as Google, Amazon, Facebook and Apple are!*

*When will we have a clear Big Data strategy?”*

* *These are probably a few questions you may read from your top-management in the almighty monthly corporate newsletter.

But if everyone understands clearly that **Big Data** is an observable phenomenon of **exponential growth** of acquired **data** by all organisations, it doesn’t explain **what we should do with it**.

This snake biting its own tail question has existed since the adolescence of computer science: after the Founding Fathers *(Von Neumann, Turing, Codd, etc.)* the science quickly took a sharp turn towards *“answering”* both developers and end-users needs rather than *proactively proposing* **meaningful concepts** to deal with **real-world challenges**.

Starting in the 80’s, this pattern became very clear with programming language engineering: *object-oriented programming* came to *triumph* because it made everything *simpler* for *developers*, but no one ever really wondered **where you may find** concepts such as *classes, encapsulation, inheritance, overloading or polymorphism* in the **real world**.

Analysts and *(mostly)* designers had then to fit these concepts into systems’ blueprints: the *Universal Modelling Language (UML)* with its thousands of pages of confusing *-at best-* or contradictory instructions finally came into being.

**Comparing** the inception of **UML** with **Codd’s relational model** is almost **too good a case study to be true**. On one hand, you get around 2,000 pages of descriptions and recommendations on how to draw a given diagram for a possible *“use-case”*, trying really hard to consistently keep-up with the new techniques created almost yearly by software engineering.

On the other, you get a **concise**, **mathematically proven** **model** described in a less-than **20-page article**, written in the 70’s, on how to get the **best possible data structures** along with *relational algebra*, a formal way for manipulating them.

Let alone the fact that relational databases systems allow for the comfortable way of modern life to exist *(you know: your banking, most of internet content, etc.)*, it’s completely *agnostic* to software engineering techniques and can be designed using *Entity/Relationship* modelling, which only takes a *few hours to understand*.

**Mathematics rules**! It always has and always will. It makes a theory sound and its applications straightforward, as long as you can understand its equations.

And this makes a link with Big Data. **Big Data raises two challenges**:

- Storing and processing unstructured data (natural language, sound, video, etc.) mostly triggered by “social” or web X.0 activities
*(with X ≥= 0)*; - A combinatorics problem.

I shall leave the first as the discussion for a future blog. But the second is rather simple to explain: take a local supermarket with 300 products on the shelves. It won’t really generate *“Big Data”* as of now understood to be in **petabytes** or above.

But if you try and analyse a simple cross-sale question (*aka “association rule”)*:

*“if customer buys product X then he buys product Y”*

You need to store table-records where each customer is a row and each product a column with a binary outcome: 0 = no, 1 = yes.

Well, that’s **2 ^{300} possible outcomes per row**. And would a few Intel Core i7 CPUs deal with that in pure combinatorics exploration?

**No**, they wouldn’t: this number is **bigger** than the number of **atoms** in the currently **known universe**.

**Now, imagine the combinatorial potential for Wal-Mart or Amazon …**

That’s why we **need** **specific algorithms** to deal with **Big Data analytics**. **Traditional statistics** are not made to deal **either** with this **combinatorial** **space** or to search for *“something”* for which **we don’t know the question beforehand**.

We need mathematics here to guide us towards **reducing the lattice-space** of all **possibilities** to make it **computationally possible**.

That’s number and graph theory, and it’s not for the faint-hearted. It’s somewhere between top-class engineering and research.

The APriori algorithm was a pioneer in the field of *“association-mining”*, but research is still very active: at the time of APriori’s release, social networks didn’t exist nor did local corner shops have computerised tills.

**Big Data is not only big in storage space: it’s exponentially bigger than <whatever>bytes of it!
**

**Don’t be lost in combinatorics, apply to Data ScienceTech Institute programmes!**