Resources available on the huggingface platform:
- dataset for training models used in information retrieval/building embeddings radlab/polish-sts-dataset
- data for pre-training/fine-tuning models with a dominant legal language available in jsonl radlab/legal-mc4-pl
- similar to legal-mc4-pl data for training models, this time Polish Wikipedia radlab/wikipedia-pl
- Wrocław University of Technology corpus kgr10 available as jsonl text format, data for model pre-training/fine-tuning: radlab/kgr10