Matryoshka

Semantic-Aware Parsing for Security Logs

Security analysts struggle to quickly and efficiently query and correlate log data due to the heterogeneity and lack of structure in real-world logs. Existing AI-based parsers focus on learning syntactic log templates but lack the semantic interpretation needed for querying. Directly querying large language models on raw logs is impractical at scale and vulnerable to prompt injection attacks.

We introduce Matryoshka, the first end-to-end system leveraging LLMs to automatically generate semantically aware structured log parsers. Matryoshka combines a novel syntactic parser—employing precise regular expressions rather than wildcards—with a completely new semantic parsing layer that clusters variables and maps them into a queryable, contextually meaningful schema. This approach provides analysts with queryable and semantically rich data representations, facilitating rapid and precise log querying without the traditional burden of manual parser construction. Additionally, Matryoshka can map the newly created fields to recognized attributes within the Open Cybersecurity Schema Framework (OCSF), enabling interoperability.

We evaluate Matryoshka on a newly curated real-world log benchmark, introducing novel metrics to assess how consistently fields are named and mapped across logs. Matryoshka’s syntactic parser outperforms prior works, and the semantic layer achieves an 𝐹1 score of 0.95 on realistic security queries. Although mapping fields to the extensive OCSF taxonomy remains challenging, Matryoshka significantly reduces manual effort by automatically extracting and organizing valuable fields, moving us closer to fully automated, AI-driven log analytics.


Contributors

Julien Piet, Vivian Fang, Rishi Khare, Vern Paxson, Raluca Ada Popa, David Wagner

Publications