AI RESEARCH
Tested chunking + embeddings data from 3 production websites. [P]
r/MachineLearning
•
Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density: Workspace Sources Chunks HIGH MEDIUM LOW REJECTED Intercom 188 941 96 200 541 104 HubSpot 251 1705 40 508 1153 4 KPMG 53 209 3 14 127 65 (HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = na/legal/careers) 87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.