Build your Own AI Research Assistant via “small” Large Language Models

Imagine an efficient, intelligent AI researcher active on your laptop… Recently released small Large Language Models can run locally with comparatively moderate RAM (~5 GB) and CPU utilization and your data never leaves your machine. Sounds perfect, right?

Unfortunately, these advantages come with significant drawbacks – these models are far less capable than their larger counterparts and lack some features built into the APIs provided by larger models. “Out of the box”, these models are poorly suited for handling most tasks in specialised domains like cybersecurity.

However, there are several techniques available to improve the effectiveness of these models for narrowly defined tasks relevant to researchers.

  • This presentation provides a live coding walkthrough of building a tool using LLM that can:
  • Perform reverse (select rules that are similar to target) and natural language query searches on:
  • Sigma rules
  • YARA metadata and string rules
  • Search personal plaintext notes of a researcher or official documentation 
  • Select and use CLI tools and user provided functions automatically based on the data being handled

We start by downloading and running the LLM locally using only CLI tools, then move on to prompt engineering techniques and fine tuning to specialise the model for our selected task. After that we add tool-using capability through function- calling techniques (the model will be able to select and call executables and APIs). Finally, we incorporate a Vector Database and Embedding model to provide search capabilities and additional context data to the LLM. While the walkthrough focuses on Sigma and YARA rules, the techniques used can be generally applied to any task that handles structured or unstructured (plaintext) inputs. Throughout the presentation, we will discuss effective use cases for LLMs, how to guard against the fundamentally different behaviour of LLMs compared to traditional programming and standard tools, and the libraries used for working with LLMs. No prior knowledge of AI or language models is expected. Some familiarity with Python would be useful but not necessary.

Georgelin Manuel – K7 Computing

Georgelin Manuel graduated from Cochin University of Science and Technology with a Master’s degree in Computer and Information Science. Since joining K7 Computing, he has focused on enhancing product features through data mining and machine learning techniques. He has previously spoken at AVAR and CARO.