# Data Preprocessing

By default, the project removes **phone numbers, ID numbers, emails, and URLs** from the data.\
In the `settings.jsonc` file, there is also a **blocked words list** (`blocked_words`) where you can add specific words or phrases to filter out. (Entire sentences containing blocked words will be removed by default.)

> **Important**\
> 🚨 *Please make sure to protect personal privacy and do not leak any personal information!*

To preprocess your data, run the following command:

```
weclone-cli make-dataset
```

You can modify `make_dataset_args` in `settings.jsonc` to match your chat style.

Currently, only the **time-window strategy** is supported:

* Messages from the same person within a certain time (`single_combine_time_window`) will be merged into a single sentence using commas
* Q\&A pairs will be matched based on `qa_match_time_window`

You can enable `enable_clean` under `clean_dataset` to clean your data for better results.

The system currently supports **LLM-based scoring** of chat records using either:

* **Offline inference** via `vllm`, or
* **Online API-based inference**

To enable API-based inference, set `"online_llm_clear": true` in `settings.jsonc` and configure:

* `base_url`
* `llm_api_key`
* `model_name`

All models compatible with the **OpenAI API interface** can be used.

Once you have the **distribution of LLM scores**, you can use the `accept_score` parameter to filter acceptable score ranges. You may also **lower the `lora_dropout`** parameter under `train_sft_args` to improve the model’s ability to fit the data.
