A Query Engine for Data Processing with LLMs
The powerful semantic capabilities of modern language models (LMs) create exciting opportunities for building AI-based analytics systems that reason over vast knowledge corpora. A wide variety of applications require a form of bulk semantic processing, where the analytics system must process large amounts of data and apply semantic-based analysis across the whole dataset. Supporting the full generality of these applications with efficient and easy-to-use analytics systems would have a transformative impact, similar to what RDBMSes had for tabular data. This prospect, however raises two challenging questions (1) how should developers express semantic queries?, and (2) how should we design the underlying analytics system to achieve high efficiency and accuracy?
Unfortunately, existing systems lack high-level abstractions to perform bulk semantic queries across large corpora. We introduce semantic operators, a declarative and general-purpose programming interface that extends the relational model with composable AI-based operations for bulk semantic queries (e.g. filtering, sorting, joining, or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators.
LOTUS is an open source query engine, which implements semantic operators in a DataFrame API. Furthermore, we develop several novel optimizations that take advantage of the declarative nature of semantic operators to accelerate semantic filtering, clustering and join operations by up to 400x while offering statistical accuracy guarantees. We demonstrate LOTUS’ effectiveness on real AI applications including fact-checking, extreme multi-label classification and search. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that achieve up to 49.4% higher quality. Overall LOTUS queries match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to 28x faster.
Contributors
Liana Patel, Siddharth Jha, Parth Asawa, Melissa Pan, Carlos Guestrin, Matei Zaharia