06 Feb

Ask about Netflow data using natural language

This is one of those projects destined to end up in a drawer or only live for a few days. I had a lot of NetFlow data on hand and came up with the idea of writing a chatbot to query that data using natural language. Since I’d never built a chatbot before, why not? Up until now, I mostly worked on models with OpenAI or Azure. I wasn’t too keen on exposing my NetFlow data to public models, so I ran the model on my own server. Plus, it just so happened that the news about the Chinese model DeepSeek had just broken that week. Ultimately, the data doesn’t even get sent to the model—they don’t need to since everything is stored in a local ElasticStack. Although the concept changed mid-development, my desire to use a local model remained. A few people asked for more details about the project and the code to be shared, so I invite you to read on and check out the repository.

Architecture

Let’s start with the architecture of the solution.

NetFlow data from two firewalls is collected by the ElasticFlow collector. It’s a very convenient—and in many cases free—NetFlow/sFlow/IPFIX collector with built-in data export mechanisms, for example, to the ELK stack. ElasticSearch is used both to receive data from ElasticFlow and to make that data available to other applications—such as Kibana for visualization or my own script.

The LLM models run on a dedicated server with a suitable GPU. I run the models using Ollama software, which exposes an API for queries on the local network.

The script accepts a user query formulated in natural language. It then sends this query to the model to obtain a DSL filter structure in response. This filter is essential to extract only the records we’re interested in from Elasticsearch. The filter generated by the model is used in the Elasticsearch query. The response, which is in JSON format, is then parsed again by the LLM to translate it into natural language and display it to the user.

Elasticsearch query

Let’s start by looking at the Elasticsearch queries. The method query_es() defined in the ex_client.py module handles them. The communication is very straightforward because it uses the Elasticsearch API. We need to query the data available under a specified index or group of indices and narrow the query criteria using a DSL filter. A DSL (Domain Specific Language) filter in Elasticsearch is a specialized query language that allows you to define precise search and filtering operations on data within the indices. Its syntax is based on the JSON format, which makes the queries readable and easily integrated with other systems that use this popular data format. Here’s an example:

{
   "query":{
      "bool":{
         "must":[
            {
               "term":{
                  "destination.port":9200
               }
            },
            {
               "range":{
                  "@timestamp":{
                     "gte":"now-15m",
                     "lt":"now"
                  }
               }
            }
         ]
      }
   },
   "aggs":{
      "unique_hosts":{
         "terms":{
            "field":"source.ip",
            "size":1000
         }
      }
   },
   "size":0
}

Elasticsearch provides a well-documented API. Initially, I tried using the standard requests library to send and receive data via this API. Unfortunately, issues arose related to the formatting of keys in the DSL structure sent in the request’s body. The API required that key names be enclosed in double quotes, but the requests library converted them to single quotes. I quickly switched to the elasticsearch-py library, where API handling is implemented through dedicated methods. To establish a connection, I created the get_es_client() method. I pass only the information about the indices to be queried and the DSL filter to the search() method from this library. In response, I either receive the data or an error.

Requesting the LLM to generate a DSL filter

We retrieve the user input from the console. However, we need to provide the user’s input with some context in the query sent to the LLM model. Therefore, the user’s query is embedded in a larger prompt named ADAPT_ES_QUERY, defined in the prompt_templates.py file. This way, we give the query context and supplement it with additional data. This prompt also contains detailed information about the model’s expected result. In the program’s first iteration, I only passed the user’s query to the generated prompt, later expanding it with two additional pieces of information, which I’ll describe shortly.

The method generate_text() from the llm_client.py module is responsible for communication with the Ollama API. It uses the requests library to send a POST request, including in the payload information about which model to refer to and the query content. We also include additional flags there, such as disabling response streaming.

I use the same method and approach to analyze the response from Elasticsearch and translate it, in the context of the user’s query, into an answer that is understandable to them. For this, I only use a different prompt, the template for which is stored in GENERATE_FINAL_ANSWER.

First run

So, we now have what appears to be a ready-to-go product that works as follows:

  • The user types their query into the console.
  • The script asks the LLM model to translate the user’s query into a DSL filter for Elasticsearch.
  • The script performs a query on the specified Elasticsearch indices using the obtained filter, defining the query parameters via the created DSL filter.
  • The response from Elasticsearch, in JSON format, is passed to the LLM model so that it can translate the Elasticsearch answer into natural language in the context of the original query.

Easy, what can go wrong? 🙂

Problems with responses from the LLM

I tested generating responses on models such as Llama3, DeepSeek-R1 (with parameters 14B, 32B, and 70B), and qwen2.5-coder (with parameters 3B and 32B). I’ll discuss the significance of the parameters in a moment. The first problem was somewhat different. Some models wouldn’t return just the JSON structure of the DSL filter in their response. I’m referring primarily to DeepSeek-R1, which always ignored the request in the prompt to return only the JSON structure. Instead, its response always included the entire chain of the model’s “reasoning,” as if I had asked for an explanation.

The second problem was that some models sometimes did not return a clean JSON object as requested. For example, they would add additional Markup Language formatting. To address this, I added the method extract_json_from_llm_response() in the llm_client.py module, which is responsible for removing any extraneous information from the response. This method is called within generate_text().

A quick note here – remember, this is proof-of-concept code. The llm_client.py module should ideally include much more exception handling, data validation, etc.

Solving the first problem revealed the second, more serious issue. It turned out that the models were not generating correct DSL structures. This incorrectness manifested in two ways. First, the model would “invent” field names from the NetFlow data to use in the filters. For example, when querying for traffic on a destination port, it might sometimes use the field “destination.port,” other times “port,” or even “destination_port.” Other variations would often appear. The second problem was the incorrect structure of the DSL filter itself. Known keywords such as “aggs” or “size” were placed in the wrong parts of the structure, rendering the entire construction invalid.

How does the LLM work?

This brings us to the theoretical part, explaining why this happens. I should note right away that the information in this paragraph is somewhat simplified to be understandable even for those who don’t know how such models operate—without delving too deeply into the details.

Incorrect Generation of DSL Filters May Stem from Several Factors. Three Basic Ones Are:

Lack of Precise Training Data and Specific Patterns:
Language models are trained on massive datasets that might not contain sufficiently detailed and uniform examples of DSL queries for Elasticsearch—especially for specialized use cases like NetFlow traffic analysis. As a result, the model “hallucinates” field names and may arrange the query structure incorrectly, placing key elements in the wrong locations. This is due to the lack of strict supervision over the precise syntax and structure of the DSL, causing the model to improvise based on inconsistent or incomplete patterns found in the training data.

Model Limitations in Understanding Domain Structure:
LLM models generate text based on the probability of subsequent words. However, without an explicit mechanism to validate the structure of the resulting code or query, they can make both syntactic and semantic errors. The absence of a validation mechanism during generation means that even if the model “knows” which keywords should be used, it does not always correctly position them within the hierarchy required by the DSL.

A Model Chosen with Too Few Parameters:
In theory, increasing the number of parameters can enhance the model’s ability to capture complex patterns and relationships in the data, which would theoretically translate into more precise responses. However, a larger number of parameters also brings challenges, such as higher computational requirements, increased susceptibility to “hallucinations,” and difficulties in precisely controlling the generated text in very specific domains. Conversely, a model with fewer parameters may have a limited ability to model very complex dependencies, which can also affect the precision in generating specialized DSL structures. However, it might be more “conservative” in its improvisation. The numbers like “70B” or “32B” in a model’s name refer to the number of parameters in the model (70 billion and 32 billion, respectively).

Tuning the LLM

Issues with the LLM can be addressed in several ways. The first idea was to switch to using models from OpenAI or Azure—specifically those not hosted locally on my server. Such models returned much more correct data, particularly regarding the DSL structure’s correctness. They seem to be a bit better fine-tuned. However, the project’s idea was to use a local model.

The second option is to enrich the prompt sent to the LLM. That’s why, in the ADAPT_ES_QUERY template, I pass two additional pieces of information. As base_query_template, I provide an example DSL filter structure as a pattern. This is meant to help the model construct the response more accurately. The sample filter is stored in a file within the repository.

The second parameter, index_mapping, is a mapping of fields in the Elasticsearch index. This helps the model use the correct keywords. The index is several megabytes and is fetched from ElasticFlow via an API.

Introducing these changes made the query to the model significantly larger, which in turn requires much more resources to handle. This resulted in a slight improvement in the data returned by the model, but not enough to say that the problem has been fully solved.

The model performs much better when translating the Elasticsearch response into natural language—provided that the response contains the data the user asked for. Again, we face the issue that this is a proof-of-concept project, and proper mechanisms for verifying the Elasticsearch response should be implemented. Otherwise, the model’s interpretation of empty or incorrect data may generate an invalid answer for the user.

Further tuning ideas

This leads us to the section titled “What Else Can Be Done.” I haven’t tested these ideas yet:

  • Using a Properly Trained Model:
    In my tests, I was limited by the performance of the server and GPU running the LM models and the availability of the models themselves. Models like OpenAI’s o1 seemed to generate better responses than Llama3. I also tested the recommended model qwen2.5-coder. This model has been fine-tuned on large datasets containing code and technical documentation, enabling it to generate coherent, correct, and contextually appropriate code in various programming languages such as Python, JavaScript, Java, C++, and others. As a result, the model can create new code snippets and assist in debugging, refactoring, or analyzing existing solutions. Unfortunately, I did not notice a significant difference between it and, for example, DeepSeek-R1.
  • Searching for a Model Trained for Working with Elasticsearch:
    Unfortunately, I haven’t found a publicly available model already trained on data specific to communicating with Elasticsearch. Whether such a model exists in a free or commercial form, I do not know. Will one be developed? That’s not for me to say—but Elasticsearch producers are paying close attention to the use of AI, so something like that might emerge.
  • Fine-Tuning the Chosen Model Yourself:
    Attempting your own model fine-tuning is also an option. However, this comes with the necessity of having training data and preparing it appropriately. Training a model does not simply involve dumping a text file into it and expecting it to magically learn how to work with the data. I also do not have the appropriate training data, as I don’t work with Elasticsearch daily.
  • Modifying Query Parameters:
    When using the API interfaces of language models, you can pass some parameters that affect how the responses are generated. These parameters are not visible in web interfaces. For example, temperature determines the level of “randomness” in the generated text. A value close to 0 makes the model generate more predictable and deterministic responses, whereas higher values (e.g., 0.8 or 1.0) increase the variety of outcomes, which might lead to more creative but sometimes less consistent responses. top_p (nucleus sampling) sets the probability threshold within which the model selects the next tokens. For example, a value top_p=0.9 means that the model will choose from tokens whose total cumulative probability amounts to 90%. This parameter helps control the distribution of probabilities and can affect the creativity of the generated text. In this way, you might try to control the level of “creativity” in the generated responses.

In the project repository, I have included a correct DSL filter for querying which devices in the network have communicated on port 9200 during the last 10 minutes. By running the script with the --override-dsl parameter, you can check for yourself how much the LLM model’s generated response deviates from the expected outcome when asking an appropriate question. You can also see how the LLM model interprets the Elasticsearch response. Additionally, running with the --debug flag will display a lot more information on the screen—not just error messages.

Summary

This small project showed me that I can write a simple chatbot in one evening and the limitations of current LLM models. Remember that working with them through a chat interface on a website is somewhat different from using them programmatically. The project was successful because it demonstrated that such a chatbot could be built.

And what did I learn in the process? Well, that’s for me to know! 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.