Cloudera, a company that started in the Big Data era, is now making its way into the era of Big AI with Large Language Models (LLMs). Cloudera has launched its strategy and tools to help businesses integrate LLMs and generative AI into the company’s Cloudera Data Platform (CDP). The platform provides an open data lakehouse model that enables organizations to run data analytics operations on top of data lake storage.
Easier Integration with LLMs
With LLM integration, Cloudera is making it easier for enterprises to directly integrate with open-source LLMs from Hugging Face and open-source vector databases to build AI applications. Cloudera also announced the general availability of its observability platform, which will help organizations monitor data workloads running on CDP. Ram Venkatesh, CTO of Cloudera, stated that with LLMs, businesses can take advantage of a new way of processing data and getting real-time insights at a scale that has never been possible before.
Cloudera is not building its own LLMs but rather making it easier for enterprises to use LLMs to gain insights from data that organizations already have in a data lakehouse. Cloudera’s catalog of reference architectures already provides AI models for customer churn and fraud analytics. The company is now expanding with architectures for conversational AI and LLMs. CDP users can select the new LLM reference architecture from the catalog and have it installed in their environment in a few minutes.
Zero-Shot Learning Model
The training approach that Cloudera is embracing is known as a zero-shot learning model, where an existing LLM can quickly benefit from an existing data source. The initial set of LLMs that Cloudera is integrating with are open-source models that can run entirely inside the Cloudera platform. By running the LLM in the same platform as the data, organizations can ensure that no data ever leaves the enterprise’s purview, and no external API calls are being made. Cloudera is enabling its users to choose which open-source vector database to use. Among the options are Milvus, Weaviate, and qdrant.
Vector Databases
Data lakehouse technology relies on data object storage, which Venkatesh said is often an excellent way for organizations to store unstructured and semi-structured data. To work with AI, there is a need to organize the data with a vector database. Cloudera is creating a vector database for an LLM deployment, which does not mean enterprises are duplicating data, with one set in the lakehouse and another in the vector database. Rather than duplicating data, what a vector database does is provide a functional index of the data as vectors.
Flattening the Pyramid Structure
When Cloudera started in 2008, Big Data, in the form of the open-source Hadoop project, was the company’s foundation. The Big Data market has shifted over the years into the data lakehouse space, where organizations use query engines, typically SQL-based, for data analytics on data stored in cloud object storage repositories. Venkatesh now sees LLMs as the next logical step on the path forward from Big Data.
Venkatesh explained that Big Data created a pyramid-like approach for data analytics, where the Big Data resides at the bottom and only a small amount of data could be analyzed at the top. With LLMs, that pyramid structure has flattened out, with significantly more data available for analysis, and easier methods. With LLMs and the new wave of AI, it is now possible to analyze all the data at the topmost layer, and instead of querying with just SQL or Spark, it’s English or natural language queries. The data only needs to be ingested once, and the benefits of that ingestion from a vectorized embedding can be used multiple times, so all queries can take advantage of the semantic store.
Cloudera’s integration of LLMs and generative AI into its platform will make it easier for businesses to gain insights from data they already have in their data lakehouse. By using zero-shot learning models, Cloudera is enabling businesses to quickly benefit from the data they already have. The integration of open-source vector databases will help organize data, and businesses can choose which database to use. Cloudera’s move into the era of Big AI with LLMs is the next logical step on the path forward from Big Data, and it will provide significantly more data for analysis and easier methods.
Leave a Reply