Data Preprocessing
By default, the project removes phone numbers, ID numbers, emails, and URLs from the data.
In the settings.jsonc file, there is also a blocked words list (blocked_words) where you can add specific words or phrases to filter out. (Entire sentences containing blocked words will be removed by default.)
Important 🚨 Please make sure to protect personal privacy and do not leak any personal information!
To preprocess your data, run the following command:
weclone-cli make-datasetYou can modify make_dataset_args in settings.jsonc to match your chat style.
Currently, only the time-window strategy is supported:
Messages from the same person within a certain time (
single_combine_time_window) will be merged into a single sentence using commasQ&A pairs will be matched based on
qa_match_time_window
You can enable enable_clean under clean_dataset to clean your data for better results.
The system currently supports LLM-based scoring of chat records using either:
Offline inference via
vllm, orOnline API-based inference
To enable API-based inference, set "online_llm_clear": true in settings.jsonc and configure:
base_urlllm_api_keymodel_name
All models compatible with the OpenAI API interface can be used.
Once you have the distribution of LLM scores, you can use the accept_score parameter to filter acceptable score ranges. You may also lower the lora_dropout parameter under train_sft_args to improve the model’s ability to fit the data.
Last updated