Resources available on the huggingface platform:

  • dataset for training models used in information retrieval/building embeddings radlab/polish-sts-dataset
  • data for pre-training/fine-tuning models with a dominant legal language available in jsonl radlab/legal-mc4-pl
  • similar to legal-mc4-pl data for training models, this time Polish Wikipedia radlab/wikipedia-pl
  • Wrocław University of Technology corpus kgr10 available as jsonl text format, data for model pre-training/fine-tuning: radlab/kgr10