Connect Your Data with Firecrawl
Copy page
Connect websites to your agents using Firecrawl
Overview
Firecrawl is a web scraping and web crawling platform that extracts clean content from web pages and converts it to markdown or structured JSON, ready for embedding and use in RAG pipelines.
With Firecrawl you can connect your agents to:
- Websites: Website crawling and indexing for extracting clean content from web pages
- Web pages: Individual page scraping with automatic content extraction
RAG pipeline workflow
Here's what a complete RAG pipeline looks like to connect your websites to your agents:
Scrape web content with Firecrawl - Extract clean markdown from websites
Save the clean markdown - Store scraped content locally
Index the documents in Pinecone - Using Pinecone Assistant SDK
Agent queries via MCP server - Retrieves relevant content using semantic search
Getting started
Prerequisites
Before we get started, make sure you have the following:
- A Firecrawl account
- A Pinecone account
- uv installed
- A python virtual environment running
Step 1: Set up Firecrawl and collect data
Install Firecrawl and retrieve your API key from firecrawl.dev:
Save your API key to a .env file:
The following script uses Firecrawl to explore a website's structure, identify all available pages, and convert each page's content into markdown files.
For a more in-depth tutorial on programmatically scraping with Firecrawl, follow Firecrawl's guide under the "Building a RAG Pipeline" section.
Step 2: Set up Pinecone Assistant and index your documents
We'll load those markdown files, chunk them, and store them in Pinecone. First, install the required packages:
Set up your Pinecone API key environment variable:
In your Pinecone Assistant, create a new assistant named "drug-info-rag". The code below indexes your documents in the assistant with their embeddings.
Step 3: Get your Pinecone Assistant MCP server URL
- Navigate to the Settings tab in Pinecone Assistant
- Copy the MCP URL provided
Step 4: Register the MCP server
Register the Pinecone MCP server as a tool in your agent configuration. Replace <your-mcp-url> with the MCP URL you copied in Step 4.
Using TypeScript SDK:
You can create your credential using keychain, nango, or environment variables, but in this example we use environment variables.
Using Visual Builder:
-
Add a Pinecone credential:
- Go to the Credentials tab in the Visual Builder
- Click "New credential"
- Select "Bearer authentication"
- Enter:
- Name:
Pinecone API Key(or your preferred name) - API key: Your Pinecone API key (found in your Pinecone dashboard)
- Name:
- Click "Create Credential" to save
-
Register the MCP server:
- Go to the MCP Servers tab in the Visual Builder
- Click "New MCP server"
- Select "Custom Server"
- Enter:
- Name:
Pinecone Documents - URL: Your MCP URL from Pinecone Settings tab
- Transport Type:
Streamable HTTP - Credential: Select the Pinecone credential you created
- Name:
- Click "Create" to save the server
-
Add the MCP tool to your sub agent:
- Drag the Pinecone Documents MCP tool onto your agent canvas and connect it to the sub agent
Step 5: Use the Pinecone Assistant MCP server in your agent
Once you have registered your MCP server as a tool and connected it to your agent, your agent can use the Pinecone Assistant tool to search and retrieve relevant content from your uploaded documents.
Ask an interesting question like, "What are the primary uses of amlodipine and atorvastatin, and how do they work in the body?"
The Pinecone tool provides a get_context function that retrieves relevant document snippets from your knowledge base. When your agent calls this tool, it will:
Search semantically: Use vector similarity search to find the most relevant content based on the query
Return formatted snippets: Each result includes:
file_name: The name of the file containing the snippetpages: The page numbers where the snippet appears (for PDFs and DOCX files)content: The actual text content of the snippet
Parameters:
query(required): The search query to retrieve context fortop_k(optional): The number of context snippets to retrieve. Defaults to 15.