, , ,

From storing huge information on economic parameters, market prices, customers behaviour, stress tests definitions and results, compliance legislations and rules, the financial sector and not only it tends to become a huge consumer of something that is called a data lake. Especially in a highly integrated risk management era and in a need for the fusion of multiple knowledge sources, when data pools are already there or available, data lakes are cost- and time-effective solutions coming in, both for data that you think you would like to analyse and for invading real time data.

Accordingly to its formal definition, a data lake is a storage repository that holds a vast amount of raw data in its native format. It means that data is not pre-categorized at the entry point and therefore, especially in online analytical processing, no optimal form is dictated by the fact that it has to support specific types of analysis. A data lake holds a vast amount of events.

The data lake solution provides a platform for a historical type of archive. It contains data from many different sources, with people in the organization being free to add or update data to the data lake. One launches Google type queries and then provides additional fields one may create and identify, searching interactively and expanding the description of the structure of the big data at the same time.

Its architecture seems to involve five components:

  1. A double historical layer that gives information on all historical data. The batch layer has investigating services to search, locate, and access the historical data lake. The results are periodically re-computed and cached in a serving layer. One can use for example Hadoop to sift through the data and extract the chunks that answer the questions at hand, eventually replacing OLAP (online analytical processing).
  2. A speed, on-the-fly layer that runs searching on updating, fresh data (say maximum one hour old), in real time, at low latency, eventually starting from a cached resulted field. It queues and streams the data, while updating the data lake, giving a view on the most recent data and favouring decision takings.
  3. Lake services, which prepare, integrate and store the information from the data lake in loose pre-definable historical and on-the-fly double catalogues. With the searching results becoming available, new relationships are created between different sources of big data. Here data maybe safe and properly protected via tokenizing, encryption, key management and security audits.
  4. A data reservoir, where you check the reliability and do the cleansing of the data, where people in the organization may access as necessary. There the data is prepared to answer specific questions. Through this type of container, big data actually becomes useful
  5. A reactively managed engine to handle the real time constraints (an engine programmed to respond to the events, to scale to multiple cores and multiple server nodes, to be reselient to software, hardware and connection failures and to react in real time) that provides the libraries for the analyst experts to do their work.  With loosly coupled event handlers, as in reactive programming, the actual location of data looses importance from the tracing/functionality’s point of view and the data analysis gains in scalability and in event triggered response to the final user’s request.

The last layer actually represents the presently used software, with about three times more processing time being currently consumed by unautomated data sorting to make the data’s usage possible.

The data lake concept is about 5 years old, and there are not yet clearly differentiated vendors of complete data lake solutions.  Let them be  financial solutions or not Financial ones. Everybody is at the very beginning, but everybody is equally in the very need of competitive advantage. One visualises the financial data in the way that is understood and used at its best, as the V^3 approach (variety, velocity and vagueness of data) in operational risk.

Whatever the final goal is, clearly identifying what a data lake software concept might be about, in my opinion, a very good starting point.