Data Preprocessing

By default, the project removes phone numbers, ID numbers, emails, and URLs from the data. In the settings.jsonc file, there is also a blocked words list (blocked_words) where you can add specific words or phrases to filter out. (Entire sentences containing blocked words will be removed by default.)

Important 🚨 Please make sure to protect personal privacy and do not leak any personal information!

To preprocess your data, run the following command:

weclone-cli make-dataset

You can modify make_dataset_args in settings.jsonc to match your chat style.

Currently, only the time-window strategy is supported:

  • Messages from the same person within a certain time (single_combine_time_window) will be merged into a single sentence using commas

  • Q&A pairs will be matched based on qa_match_time_window

You can enable enable_clean under clean_dataset to clean your data for better results.

The system currently supports LLM-based scoring of chat records using either:

  • Offline inference via vllm, or

  • Online API-based inference

To enable API-based inference, set "online_llm_clear": true in settings.jsonc and configure:

  • base_url

  • llm_api_key

  • model_name

All models compatible with the OpenAI API interface can be used.

Once you have the distribution of LLM scores, you can use the accept_score parameter to filter acceptable score ranges. You may also lower the lora_dropout parameter under train_sft_args to improve the model’s ability to fit the data.

Last updated