August 2015

Why Big Data Machine Learning is over solving a non-existent problem?

Big Data is the new buzz word. Every data startup focusses on solving Big Data problems these days. Almost all machine learning as a service companies have solutions for Big Data problems. We at hopdata think differently. Big Data is a problem only for few very large companies with over petabytes of data. Most (Almost all) startups and companies haven’t reached scale where they have millions of users and generate petabytes of data.

We at Hopdata focus on providing solutions to create machine learning models for small datasets (we call it reasonable data).

Let’s try to analyze what actually is Big Data. Big Data’s definition on Wikipedia:

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk.

But how many companies actually possess large and complex dataset and are seeking solutions for solving large scale data modeling?

The problem with modern machine learning as a service solution is that they primarily focus on Big Data solutions. With Big Data, the systems have to be designed to deal with complex to train “machine learning” models.

Machine Learning on Big Datasets is done using distributed systems.

Using  distributed systems restricts use of only a few machine learning models (like Decision Tree, Forest etc.) as most of the machine learning algorithms are not made to work on distributed systems.

Distributed machine learning algorithms are less accurate than algorithms which run on a single system. Hence the accuracy of distributed systems will always be less as compared to non-distributed systems.

So, if you try to train a model with small dataset on a “distributed system”, you would see less accuracy.

What is a small dataset?

We have established that Big Data means peta bytes of data. But how much data can a non-distributed system handle. The answer is as big as the “RAM” or “CPU” allows. We at Hopdata use very large VMs with over 128/256 GM RAM to process datasets and train each dataset on over eight different machine learning algorithms. We have designed our system to efficiently process the whole dataset at once, instead of breaking it into chunks.

We are easily able to process over  few Giga Bytes of text data with parallel processing and proper utilization of resources.

How much text is actually in 1 GB ?

1GB = 1024mb = 1048576kb = 1073741824 characters

Assume an average of 5 characters per word, plus a space = 178956970 words. At 200 words per page, that’s 894784 pages, roughly 900,000 pages per 1GB of memory.

So around a million records can be uploaded in the form of a “csv” file. This could represent data for over a million customers or a million article etc. Also the data used to train machine learning models is different from data generated by analytics or logs.For machine learning the data is much more refined and does not have any duplicate records. Hence each dataset only considers data which makes sense or contribute features to the data model.

At Hopdata, we solve real world problems for the masses. Focusing on small datasets, we are able to provide services at affordable prices with the highest accuracy as compared to any other machine learning service.