Building a high-performance RAG knowledge base
The quality of your chatbot's answers depends directly on the quality of your knowledge base. This guide explains how to optimise it to get precise, relevant responses.
How indexing works (in plain English)
Chatbot Flow uses RAG technology β Retrieval-Augmented Generation. Here is exactly what happens when a visitor asks a question:
Automatic crawl of your site
Every 24 hours, our servers crawl your site via the WordPress REST API. Only pages that are new or have been modified since the last crawl are reprocessed β ensuring the knowledge base stays current without any overload.
Semantic chunking of content
Each page is split into semantic "chunks" β coherent text blocks that are neither too large nor too small. This step is critical: good chunking allows the chatbot to retrieve exactly the right passage to answer a question.
Vectorisation and pgvector storage
Each chunk is converted into a numerical vector (embedding) and stored in your dedicated pgvector database. This mathematical representation makes it possible to find passages that are semantically close to a question, even if the exact words don't match.
Hybrid search at query time
When a visitor asks a question, the system combines vector search (semantic) and text search (keywords) to find the most relevant passages, then passes them to the AI model to formulate a response.
Which pages to index first
Not all pages are equally valuable for the chatbot. Focus on high-information pages first:
- FAQ page β The most valuable resource. Each question/answer is a perfect chunk for RAG. If you don't have a FAQ, create one.
- Product and service pages β Detailed descriptions, features, use cases, pricing, lead times... The more complete your description, the better the answers.
- "About" page β Who you are, where you are based, since when, what your mission is. This information is frequently asked for.
- Pricing pages β Plans, prices, what's included, what's not, refund policy.
- Documentation and tutorials β For SaaS products, technical documentation is a perfect corpus for RAG.
- Blog articles β Particularly relevant if your blog covers topics directly related to your products or services.
Adding supplementary content: free-form text blocks
Some information is not on your public site but is nonetheless essential for your chatbot: internal returns policy, answers to common objections, detailed shipping information, contacts by department...
Free-form text blocks let you add this content directly in your WordPress back-office, without publishing it on your site. This content is indexed exactly like your web pages.
Examples of effective text blocks:
- "Our standard delivery time is 3β5 business days within France. Orders placed before 2pm are dispatched the same day."
- "For refund requests, contact service@mysite.com with your order reference. Refunds are processed within 5 days."
- "We offer free demos every Tuesday and Thursday at 2pm. Register via the contact form."
Uploading PDF files
Do you have commercial brochures, product sheets, user guides or catalogues in PDF format? They can be indexed directly into your knowledge base.
From the "RAG Content" section of your WordPress back-office, upload your PDFs directly. They are sent to our servers (never stored in your WordPress media library), converted to text, semantically chunked and indexed like any other page.
Particularly useful PDF types:
- Detailed product sheets (technical specifications, dimensions, certifications)
- Installation or user guides
- General terms of sale and warranty conditions
- Price catalogues
- Company presentation documents
Excluding unnecessary or sensitive pages
Not every page on your site deserves to be in the knowledge base. Excluding unnecessary pages improves answer quality by reducing noise, and keeps you under your plan's indexed page limit.
Pages to always exclude
- WordPress administration pages (
/wp-admin/) - E-commerce cart and checkout
- User account pages (
/my-account/) - Login and registration pages
- Uninformative category and tag archives
- Search results pages
- Purely formulaic legal pages
- Draft or test pages
If you are approaching the 1,000-page limit (base plan), excluding WordPress archives and taxonomies is often enough to free up several hundred slots. The Volume option (+β¬5/month) raises the cap to 10,000 pages if your site is very large.
Optimising your page content for the chatbot
The quality of your content directly affects the quality of responses. Here are writing best practices for RAG-friendly content:
- Answer questions explicitly. Instead of "Our lead times are fast", write "Our delivery times are 2 to 4 business days within mainland France."
- Use clear headings (H2/H3). Headings help with semantic chunking. "Returns policy" as an H2 helps the chatbot locate that section.
- One idea per paragraph. Short, focused paragraphs are chunked better than a long dense block of text.
- Create a proper FAQ. It's the most effective format for RAG. Each Q&A pair is a perfect chunk.
- Avoid vague phrasing. "We offer various options" cannot be turned into a useful answer. "We offer 3 plans: Starter at β¬29, Pro at β¬79 and Enterprise on request" β yes.
- Explicitly mention your industry, location and speciality. These details contextualise all responses.
Manual vs automatic sync
Automatic sync: The crawl triggers once every 24 hours at a fixed time randomly assigned to your account at registration (to spread the load). All pages modified since the last crawl are reprocessed. New pages are detected and added automatically.
Manual sync: From your WordPress dashboard, the "Sync now" button triggers an immediate crawl. Useful when you've just published an important page, corrected information or added supplementary content. Manual sync does not replace the automatic cycle β it supplements it.
Good to know: Text blocks and PDFs you add directly in the back-office are indexed immediately, without waiting for the next crawl. This is the fastest way to enrich your knowledge base.
Ready to launch your chatbot?
Install Chatbot Flow in 5 minutes. The first crawl of your site starts automatically. 30-day trial, no credit card required.