# Setting up the config

In order to better use this project for your own environment, you need to understand a few concepts that we have in our settings files.

We have three base setting files:

* `.env`: used to set environment variables for the project;
* `knowledge.csv`: used to store the knowledge base that you want to use in the project;
* `prompt.toml`: used to set the prompt configuration for the project.

To use this project, you need to have a `.csv` file with the knowledge base and a `.toml` file with your prompt configuration. If you just want to test the project, you can use the files we have in the repository's `sample-data` directory.

If you want to use custom files as we mentioned in the quick start session, we recommend that you create a folder inside this project called `data` and put CSVs and TOMLs files over there.

## Environment Variables

Look at the [`.env.sample`](https://github.com/talkdai/dialog/blob/main/docs/.env.sample) file to see the environment variables needed to run the project.

Here is a brief explanation of the environment variables:

* `DATABASE_URL`: the URL to connect to the database. It is used by the `pgvector` extension to store the embeddings generated by the `load_csv.py` script.
* `OPENAI_API_KEY`: the OpenAI API key used to connect to the OpenAI API.
* `PROJECT_CONFIG`: the path to the `.toml` file with the prompt configuration.
* `DIALOG_DATA_PATH`: the path to the knowledge base CSV file.
* `PORT`: the port where the application will run. The default is `8000`.
* `VERBOSE_LLM`: if set to `true`, the LLM will return the full response from the Langchain debug. The default is `false`.
* `DIALOG_LOADCSV_EMBED_COLUMNS`: the columns of the knowledge base (csv) that will be used to generate the embedding. By default, the `content` column is used.
* `DIALOG_LOADCSV_CLEARDB`: if set to `true`, the script `load_csv.py` will delete all previously imported vectors and reimport everything again.
* `COSINE_SIMILARITY_THRESHOLD`: the cosine similarity threshold used to filter the results from the database's similarity query. The default is `0.5`.
* `PLUGINS`: the path to the plugins that will be loaded into the application comma-separated. An example is: `dialog-whatsapp,plugins.my-custom-plugin`.
* `OPENWEB_UI_SESSION`: the session ID used to connect to the OpenWeb UI API. By default, it is `dialog-openweb-ui` and it can be changed.

## CSV format

Here is a simple example of a CSV file with the knowledge base:

```csv
category,subcategory,question,content
faq,promotions,loyalty-program,"The company XYZ has a loyalty program when you refer new customers you get a discount on your next purchase, ..."
```

When the `dialog` service starts, it loads the knowledge base into the database through the script called `src/load_csv.py`, so make sure the database is up and paths are correctly configured in the (see [environtment variables](##environment-variables) section). Alternatively, inside `src` folder, run `make load-data path="<path-to-your-knowledge-base>.csv"`.

### Embedding columns

By default, the `load_csv.py` script uses the `content` column to generate the embeddings.

To embed more columns together, you can add a environment variable `DIALOG_LOADCSV_EMBED_COLUMNS` in `.env` with desired columns. This is typically the case for examples like Q\&A, where question and answer are in different columns, for example:

```
DIALOG_LOADCSV_EMBED_COLUMNS=question,answer
```

### Generate an embedding `load_csv.py`

Embeddings create a vector representation of a question and answer pair from the knowledge base, enabling semantic search where we look for text passages that are most similar in the vector space.

We have a CLI that generates embeddings by reading the knowledge base `csv`. By default, `load_csv.py` performs a **diff** between the existing vector database and the new questions and answers in the `csv`.

The **CLI** has some parameters:

* `--path`: path to the CSV (knowledge base);
* `--cleandb`: deletes all previously imported vectors and reimports everything again. **In Docker** define the environment variable `DIALOG_LOADCSV_CLEARDB` can be set to `true` to enable this option;
* `--columns`: defines the columns of the knowledge base (csv) that will be used to generate the embedding. By default, the `content` column is used.

## `.toml` prompt configuration

The `[prompt.header]`, `[prompt.suggested]`, and `[fallback.prompt]` fields are mandatory fields used for processing the conversation and connecting to the LLM.

The `[prompt.fallback]` field is used when the LLM does not find a compatible embedding in the database; that is, the `[prompt.header]` **is ignored** and the `[prompt.fallback]` is used. Without it, there could be hallucinations about possible answers to questions outside the scope of the embeddings.

> In `[prompt.fallback]` the response will be processed by LLM. If you need to return a default message when there is no recommended question in the knowledge base, use the `[prompt.fallback_not_found_relevant_contents]` configuration in the `.toml` *(project configuration)*.

It is also possible to add information to the prompt for subcategories and choose some optional LLM parameters like temperature (defaults to 0.2) or model\_name, see below for an example of a complete configuration:

```toml
[model]
temperature = 0.2
model_name = "gpt-3.5-turbo"

[prompt]
header = """You are a service operator called Avelino from XYZ, you are an expert in providing
qualified service to high-end customers. Be brief in your answers, without being long-winded
and objective in your responses. Never say that you are a model (AI), always answer as Avelino.
Be polite and friendly!"""

suggested = "Here is some possible content that could help the user in a better way."

fallback = "I'm sorry, I couldn't find a relevant answer for your question."

fallback_not_found_relevant_contents = "I'm sorry, I couldn't find a relevant answer for your question."

[prompt.subcategory.loyalty-program]

header = """The client is interested in the loyalty program, and needs to be responded to in a
salesy way; the loyalty program is our growth strategy."""
```

> This feature is experimental and may not work as expected, so use it carefully on your environment.

## Monitoring your LLM with LangSmith

If you wish to add observability to your LLM application, you may want to use [Langsmith](https://docs.smith.langchain.com/) (so far, for personal use only) to help to debug, test, evaluate, and monitor your chains used in dialog.

Follow the [setup instructions](https://docs.smith.langchain.com/setup) and add the env vars into the `.env` file:

```
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY=<YOUR_LANGCHAIN_API_KEY>
LANGCHAIN_PROJECT=<YOUR_LANGCHAIN_PROJECT>
```
