A Store for Big Data?
A Big Data App Store solves the following problem: large corporations with great data
sets have few data scientists to work on them while IT startups with great data scientists
have little data to analyse ("Data Divide").
Large corporations usually refuse to grant startup companies access to their data, either for legal compliance
or for strategic reasons. And so thousands of great ideas and potential businesses based
on Big Data never see the light of day.
And because of this Data Divide the few companies with both huge data sets and
great data scientists such as Google or Alibaba eventually end up taking the complete market.
Worse, young scientists aware of this situation prefer to work for GAFA (Google Apple Facebook Amazon)
or BAT (Baidu Alibaba Tencent) rather than for a company without data or without great
mentors.
The Big Data App Store changes this situation by enabling startups to run
their data analysis algorithms against data lakes
of large corporations. Results can be accessed through an API strictly controlled
by the large corporation to eliminate any risk of data leaks while revenues
resulting from the use of the API can be shared between the startup company owning
the algorithm source code and the corporation owning the data.
This way a Big Data App Store can bridge the Data Divide between startups
and industry. How would this look like in real life? Imagine the following scenarios.
Photo: fotolia.fr - CYCLONEPROJECT
Scenario 1: Health Big Data App Store
A group of hospitals with 1,000 TB of PET scan images was looking for a machine
learning algorithm to predict lung cancer but had no in-house data scientists.
The group considered hiring a world-class team but the salary of known experts
was well above the salary of doctors - not acceptable under hospital rules. It
also tried to hire a team of young graduates but nothing could be produced in
two years. One day, a startup company abroad created by a famous mathematician
suggested to research and develop a suitable algorithm. The startup company asked
for a copy of all data in order to create the algorithm (soon to fit on 8 SSD disks).
But the group refused, because it was not possible to provide a copy of such data under
national laws.
Instead they proposed a different approach: rather than providing a copy of the data,
the group asked the startup company to provide a copy of the source code within
a revenue sharing agreement. The mathematician accepted, because he had not been
able to find any suitable data sets in so many years. Also, he knew that Google
was stealthly preparing something similar. So the best option for him was to
share rather then eventually loosing his competitive advantage.
The hospital group provided an open source Big Data development environment called
"Wendelin"
to the mathematician which also included a small set of sample data that some patients
had accepted to share. The mathematician could adapt his algorithm by programming
a python script based on scikit-learn
and scikit-image libraries. Once everything
was tested to work, he submitted his script for review to the Big Data App Store
setup by the hospital group.
The group development team reviewed the script, ensuring it did not include malware
or was trying to steal data. Once the review passed, the script was approved
and published in the Big Data App store allowing the algorithm to run on the
complete data set. After a week of computation, a first machine learning model
was created. Each time a new PET scan image was added to the data lake, the algorithm
was able to detect lung cancer. After a few years in operation, it was also discovered
that the algorithm could predict lung cancer with good accuracy three years in
advance.
Today, most hospitals in the world utilize this algorithm by calling an API of the
Big Data App Store. Both the source code and data remain secret. Each prediction
is sold for 1€ with revenue being shared by the mathematician and hospital group.
With over ten million API calls per year, the group is now starting to create
new Big Data applications in other fields of health.
Scenario 2: Automotive Big Data App Store
An automotive company is afraid that the combination of Google Maps,
Open Source Vehicle (OSV) and A.I. could
lead to an industry where value added moved entirely to the data economy, an
industry, in which cars are produced by small workshops and algorithms are provided by GAFA or BAT.
This company has already tried twice to build its own team of data scientists but
due to the lack of progress after three years of trying best-of-breed open source
solutions (OpenStack, HADOOP, Cloudera, Spark, Docker, etc.), it outsourced all
its Big Data to a large IT corporation. However, the best data scientists of
that corporation had also already moved to GAFA or BAT. The company thus became
a playground for the sales team of the IT corporation and an exhibition center for
legacy proprietary software with high licensing costs. All telematics services
had also been outsourced resulting in a significant part of data no longer being
owned and only available in mutually incompatible formats.
An engineer in the automotive company discovered a way to create a new algorithm that
can at the same time predict vehicle failure leading to a measurable increase
in sales. He tried to implement it with the Big Data system provided by the large
IT corporation but it failed: due to poor architecture and high price licenses,
operating costs were higher than the profits generated by this new algorithm.
He created a startup company and completed a first implementation in a couple of
weeks using a python script based on
scikit-learn machine
learning library. Initial data was collected using a low cost telematics device
purchased on Alibaba. Proprietary vehicle data was cracked by a team of engineers
in India in less time that it previously took to contact the legal department in
the automotive company.
But although a working algorithm now existed, there was no way to access data from
a large fleet of vehicles. The engineer was approached by Google and Tesla,
but tried one last time to convince the automotive company. Discussions started
with the newly appointed Chief Digital Officer but due to the strict privacy
laws in Europe or Japan, it was not possible to provide a copy of car data to
the startup company, even though there was no problem of trust leading both the
engineer and automotive company searching for ways to circumvent legal issues.
The engineer eventually suggested to the automotive company to build a "Big
Data App Store" and copy all data into it using embulk.
In order to run the algorithm efficiently, data needed to be stored in a data
structured called ndarray
that was not natively supported in the Big Data lake provided by the IT corporation.
And in order to run efficiently using scikit-learn,
Python had to be used natively. The code of the algorithm could then be uploaded
into the "Big Data App Store" removing the need for any data to be
taken "out" of the company.
The Big Data App Store was created in three months using 8 servers with 16 TB SSD
disks each. The python code was run in the automative company's big data app
store using the data available and eventually generated a 1% increase of sales
as well as higher customer satisfaction. Revenues were then shared between the
startup company and the automotive company.
The Anatomy of a Big Data App Store
A Big Data App Store can be launched in less than three months and a budget of less
than 50k€ of using Wendelin
technology.
It requires the following components:
- Reliable data collection and aggregation (to ingest data into the data lake)
- High performance scalable storage (to process data with data analysis libraries)
- Data analysis libraries (including machine learning)
- Parallel processing (to process huge data quickly)
- Out-of-core processing (to handle large machine learning models)
- Rule based data access restrictions (to isolate applications)
- Application submission workflow (to submit applications)
- API registration workflow (to let applications publish an API)
- Accounting (to count CPU usage, data usage and API calls of applications)
- Billing (to charge the different stakeholders and users)
With Wendelin, all components are based on the same technology and the same language:
python. Wendelin leverages the largest
community of data sciences: PyData. Data is
handled in its native format - ndarray.
without requiring format conversions. And with
wendelin.core,
there are no limits imposed on the size of the data set.
If one tried to build such App Store with heterogeneous technologies (ex. Java,
Python, Spark, HDFS, etc.), the high complexity of the system would result in
longer time to deliver, higher maintenance costs, frequent instability due to
API changes and possibly lower performance related to overhead. It is for example
well known that combining Python and Spark leads to various types of processing
overheads and suboptimal memory management that can create serious issues in
a mission critical system.
Moreover, embedding "rule based data access restrictions" directly into
the python programming languages used in Wendelin is still a unique feature without
which it would not be possible to operate an app store.
"Restricted Python" technology ensures that every access to every data in every line of source code
published in the Big Data App Store has previously been granted access and is
traceable. Applications in the app store are thus blocked from stealing secrets
from another application, although they share the same environment and the same
database.
Conclusion
In less than three months any large corporation can create a "Big Data App Store"
and invite thousands of startup companies to leverage their data and create new
A.I. and machine learning applications. Revenue can easily be shared between
data owners and creators of data analysis algorithms closing the Data Divide
between startups and industry and using "Big Data App Stores" to compete
efficiently with GAFA (Google Apple Facebook Amazon) or BAT (Baidu Alibaba Tencent).