Comparison of Embedding Methods for Retrieval Under Noisy
Institutional Labels

D 2025

Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels

NOVOTNÁ, Tereza a Jakub HARAŠTA

Základní údaje

Originální název

Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels

Autoři

NOVOTNÁ, Tereza a Jakub HARAŠTA

Vydání

Amsterdam, JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems), od s. 324-329, 6 s. 2025

Nakladatel

IOS Press

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

50501 Law

Stát vydavatele

Nizozemské království

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Open access článku

Označené pro přenos do RIV

Ano

Kód RIV

RIV/00216224:14220/25:00142844

Organizační jednotka

Právnická fakulta

ISBN

978-1-64368-638-7

Klíčová slova anglicky

legal information retrieval; case law; embeddings; evaluation; noisy labels; Czech Constitutional Court

Štítky

rivok

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 16. 3. 2026 14:45, JUDr. Mgr. Jakub Harašta, Ph.D.

Anotace

V originále

Retrieving relevant case law remains a time-consuming task. We compare two embedding models for Czech Constitutional Court decisions: (i) a large general-purpose OpenAI embedder and (ii) a domain-specific BERT trained from scratch on ∼34,000 decisions. We introduce a noise-aware evaluation using IDF-weighted keyword overlap as graded relevance, dual thresholds (0.20, 0.28), paired-bootstrap significance, and nDCG diagnostics. Despite conservative absolute nDCG due to noisy institutional labels, the OpenAI embedder consistently and significantly outperforms the domain BERT across all ranks and thresholds. Our framework enables robust evaluation under imperfect gold standards typical of legacy judicial databases.

Návaznosti

MPO 60273/24/21300/21000, interní kód MU

Název: CEDMO 2.0 NPO

Investor: Ministerstvo průmyslu a obchodu ČR, CEDMO 2.0 NPO

MUNI/G/1142/2022, interní kód MU

Název: Forensic Support for Building Trust in Smart Software Ecosystems

Investor: Masarykova univerzita, Forensic Support for Building Trust in Smart Software Ecosystems, INTERDISCIPLINARY - Mezioborové výzkumné projekty

Citovat

NOVOTNÁ, Tereza a Jakub HARAŠTA. Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels. Online. In Réka Markovich, Luigi Di Caro, Amon Rapp, Claudio Schifanella. JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems). Amsterdam: IOS Press, 2025, s. 324-329. ISBN 978-1-64368-638-7. Dostupné z: https://doi.org/10.3233/FAIA251605.

@inproceedings{2538137,
   author = {Novotná, Tereza and Harašta, Jakub},
   address = {Amsterdam},
   booktitle = {JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems)},
   doi = {https://doi.org/10.3233/FAIA251605},
   editor = {Réka Markovich, Luigi Di Caro, Amon Rapp, Claudio Schifanella},
   keywords = {legal information retrieval; case law; embeddings; evaluation; noisy labels; Czech Constitutional Court},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Amsterdam},
   isbn = {978-1-64368-638-7},
   pages = {324-329},
   publisher = {IOS Press},
   title = {Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels},
   url = {https://ebooks.iospress.nl/volumearticle/76912},
   year = {2025}
}

TY  - CONF
ID  - 2538137
AU  - Novotná, Tereza - Harašta, Jakub
PY  - 2025
TI  - Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels
PB  - IOS Press
CY  - Amsterdam
SN  - 9781643686387
KW  - legal information retrieval
KW  - case law
KW  - embeddings
KW  - evaluation
KW  - noisy labels
KW  - Czech Constitutional Court
UR  - https://ebooks.iospress.nl/volumearticle/76912
N2  - Retrieving relevant case law remains a time-consuming task. We compare two embedding models for Czech Constitutional Court decisions: (i) a large general-purpose OpenAI embedder and (ii) a domain-specific BERT trained from scratch on ∼34,000 decisions. We introduce a noise-aware evaluation using IDF-weighted keyword overlap as graded relevance, dual thresholds (0.20, 0.28), paired-bootstrap significance, and nDCG diagnostics. Despite conservative absolute nDCG due to noisy institutional labels, the OpenAI embedder consistently and significantly outperforms the domain BERT across all ranks and thresholds. Our framework enables robust evaluation under imperfect gold standards typical of legacy judicial databases.
ER  -

NOVOTNÁ, Tereza a Jakub HARAŠTA. Comparison of Embedding Methods for Retrieval Under Noisy Institutional Labels. Online. In Réka Markovich, Luigi Di Caro, Amon Rapp, Claudio Schifanella. \textit{JURIX 2025 Proceedings (Frontiers in Artificial Intelligence and Applications, volume 416: Legal Knowledge and Information Systems)}. Amsterdam: IOS Press, 2025, s.~324-329. ISBN~978-1-64368-638-7. Dostupné z: https://doi.org/10.3233/FAIA251605.

Přehled o publikaci