Created in partnership with Sidekick Lab

Build Your Own AI Discovery Tool

A Step-by-Step Implementation Guide

For technically savvy teams ready to implement Phase 1 data discovery using open-source tools. Estimated time: 15-30 hours.

Step 1

Assess Your Data Environment and Choose Tools (1-2 hours)

Inventory your data sources: Start by listing your databases (e.g., SQL Server, PostgreSQL, MongoDB) and systems. Use simple scripts to connect and list schemas—Python with libraries like SQLAlchemy can help query metadata without exporting data.
Select an open-source data catalog tool: These automate discovery and mapping.
Recommended: OpenMetadata – Free, easy to install via Docker, supports automated metadata ingestion from 50+ sources, AI-powered search, and on-prem deployment. It’s user-friendly for businesses and includes features for data quality checks.
Alternative: DataHub – From LinkedIn, great for enterprise-scale, with AI integrations for recommendations and lineage tracking. Deploy via Kubernetes for security.
Other options: Apache Atlas for strong governance, Amundsen for search-focused discovery, or OpenDataDiscovery (ODD) for AI-enhanced metadata.
For AI components: Use LangChain (open-source) to build agents that interact with your data. Pair it with a local LLM like Llama 3 via Ollama for on-prem AI processing.

Security note: Ensure all tools run in your infrastructure (e.g., on a virtual machine or private cloud) with read-only access to databases. Use tools like Docker for isolation.

Step 2

Add Business Context and Assessments (4-6 hours)

Install the data catalog: Download and deploy your chosen tool.
- For OpenMetadata: Run docker-compose up on a local server. Configure connectors for your databases (e.g., via YAML files) to ensure data never leaves your environment.
- Integrate authentication: Use Active Directory or LDAP for access controls.
Deploy AI locally: Install Ollama (free, open-source) to run an LLM on your hardware. Pull a model like Mistral: ollama run mistral. This keeps AI processing in-house.
Test connectivity: Grant read-only permissions to the catalog tool. Run a sample ingestion to confirm it scans without data export.

Step 3

Automate Discovery and Complete Mapping (4-8 hours)

Ingest metadata automatically:
Use the catalog’s built-in agents or crawlers.
- In OpenMetadata or DataHub, set up “ingestion pipelines” that connect to your databases and pull schema info (tables, fields, data types, relationships, dependencies) without manual effort. These tools use AI to infer lineages and detect changes.
- Example: For a SQL database, configure a connector to run periodically—it will catalog everything in one go.
Build a simple AI agent for enhanced discovery:
Using LangGraph (a LangChain extension), create an agent that queries the database metadata.
- Install LangChain: pip install langchain langgraph.
- Script example: Define an agent that uses SQL queries to list tables and fields, then uses the local LLM to summarize structures.
- Code snippet (adapt as needed):
db = SQLDatabase.from_uri(“your_database_uri”) # e.g., postgresql://user:pass@localhost/db
llm = Ollama(model=”mistral”)
agent = create_sql_agent(llm, db=db, verbose=True)
result = agent.run(“Describe all tables and their relationships.”)
This outputs a mapped view of your data.
Output:
Generate a data dictionary (e.g., export to CSV or JSON) with fields, types, and descriptions.

Step 4

Set Up Secure On-Premises Deployment (2-4 hours)

- Link to business functions: Manually tag assets in the catalog (e.g., label a “sales” table as linked to “revenue operations”). For automation, use your AI agent:
  - Prompt the LLM: “Based on table names like ‘customer_orders’, suggest business use cases like order tracking.”
  - Tools like DataHub allow custom metadata tags; integrate the agent to auto-suggest via APIs.
- Data quality and relationship assessments: Integrate open-source tools like Great Expectations (for validation) or use built-in features in OpenMetadata.
  - Run checks: e.g., for completeness, uniqueness, or relationship integrity.
  - AI enhancement: Use the agent to analyze samples: “Assess data quality in table X and flag issues.”
Create a use-case register: Compile links in a document or the catalog’s UI, showing how data connects to ops (e.g., “Inventory table → Supply chain management”).

Step 5

Build an Interactive Prototype for Natural Language Queries (4-8 hours)

Set up NLQ: Use LangChain to connect your catalog to the LLM for conversational queries.
- Example: Build a simple web app with Streamlit (open-source): Users type questions like “What fields are in the sales table?” and the agent translates to SQL.
- Code base: Extend the agent from Step 3 to handle natural language.
Foundation for future AI: Your catalog now serves as a metadata layer for advanced initiatives, like RAG (Retrieval-Augmented Generation) systems.

Step 6: Test, Iterate, and Scale (Ongoing, 2-4 hours initially)

Validate: Run end-to-end tests—ensure mappings are accurate, security holds (e.g., audit logs), and queries work.
Common pitfalls: Start small (one database). If stuck, community forums for these tools (e.g., GitHub issues) are helpful.
Time estimate: For a small business, 1-2 days; scale up for complexity.
Costs: Mostly free (open-source), but factor in server resources.

This DIY approach helps you turn data chaos into a structured foundation. If your team lacks expertise maybe consider starting small with a proof-of-concept on a subset of data.

Created for Sidekick AI Masterclass – October 29, 2025