pexels-googledeepmind-18069813.jpg

An AI Chatbot in Google Chat that queries our documents and learns from the Team

Guillem

10 reading minutes

Introduction and objectives

Since the AI boom began—first with ChatGPT and then with the models that have emerged—our team has also asked itself the same question as many others: how could we leverage this capability, not only with the model's existing knowledge, but also with our own internal information? We wanted a way to easily access the knowledge we generate every day—documentation, lessons learned, past mistakes—without having to open a thousand tools or search engines.

Bartolo

This project was born precisely from that. We wanted something practical, integrated into our work environment (in this case, Google Chat), that would allow us to ask questions about our own documents and also feed that knowledge base directly from conversations . It's a simple solution, but we believe it can be easily applied to many other teams facing the same problem.

Bartolo-Ubiqual.png

One final goal, but one that seemed very important to us, was to do all of this in a sustainable way. There's a lot of documentation and examples on how to create chatbots with AI and internal interaction that rely on very complex and expensive infrastructures. They work, but we didn't think they could be sustainable or that they would actually be used with the costs involved. So we proposed the minimum infrastructure possible to meet these goals.

I'm finally apologizing if anyone reading this is named Bartolo 😅. I'm not sure why that name, but that's what I decided to call our internal Bot. I guess SoftBot, SoftspringBot, or names like that seemed really boring to me... I could have gone for the modern way and called it Leaves , or something greenish or spring-like, without it being an existing name, but anyway, it's called Bartolo for now, for lack of a better name.

How we set it up

An alternative to this is to use a provider with a Retrieval Augmented Generation (RAG) service. Some providers integrate the chatbot directly, give you an endpoint for your chatbot with the code to upload it to the website, and have a control panel to upload documents. However, we wanted something integrated with our Google Chat, and we also wanted to do it ourselves. The cost difference is huge.

The system is composed of several connected blocks:

Diagrama Bartolo.png

Although it may seem like a lot of different pieces, let's see that it's actually quite simple. Let's go step by step from top left.

Cloud Scheduler is a Google Cloud product (we set up everything on Google Cloud, but there are equivalent products in AWS or Azure, of course) that allows you to make periodic calls to a service. In this case, for now, we've configured it to call the function we named in the diagram, Documents , once a week.

The Documents function is a Python script that checks a Google Drive folder to see if there are any new files or if existing ones have been updated. If there are any new files, it calls a Gemini model to generate the embeddings for these files and stores them in a PostgreSQL database table in Cloud SQL.

Yes, the previous paragraph is dense, and it could be a separate article later. For now, the idea is to break down the text files and extract the information that will allow us to search for similar or related content from everything we have. We store all this information in a Cloud SQL database.

On the other hand, we have a very simple script in Google App Script that acts as the Bot, Bartolo. This script contains the methods that react when he is added to a new space or when a message is sent to him.

When it receives a message, it calls the Cloud Function we've named Chat here. This function distinguishes between when we're asking it for information and when we're asking it to save something. In the first case, it looks for semantic matches in the Cloud SQL database; if there are any, it passes them as context to a Gemini model. In the second case, it generates the embeddings, as we did with the Google Drive files, and stores them in Cloud SQL for subsequent queries.

How each part works at a functional level

One idea I want to convey with the article is that what we have done in this way could have been done in many other ways, and that is why, in addition to explaining how we did it, I want to explain in this section what components, at a functional level, we had to have, and why we have chosen these, and what others could be chosen, in case others apply more to you or you had thought of doing it with others.

Document Processing : We needed something to prepare the documents. To perform RAG (Retrieval Augmented Generation), which is what we're doing by passing our own information to the generative AI model, one way is to pre-process the documents, generate embeddings from them, and save them so we can then run semantic queries (looking for similarities with what the user is typing and, once found, passing those similarities to a generative AI model to generate the response).

The other idea is to use modern models with very broad contexts and pass all our information as context. This, which is explained for example in this article (in English), has two fundamental problems: in terms of cost, we pass EVERYTHING we have in each call to the generative model, which is wasteful. On the other hand, it's possible that no matter how large the context is, at some point we'll reach its physical limitations, and this would no longer be useful (now there are models like magic.dev with contexts of up to 100M tokens, which is about 100 million syllables; it's difficult to reach the limit, but it would be a tremendous cost to pass that in each call to the bot, for example).

In terms of which machine or how to do it, there are many proposals based on a Kubernetes Cluster where we can have containers with the different parts. Economically, this involves a series of machines running 24/7, and at least in our case, it was a bit excessive, so we opted for Cloud Functions, which have a low latency (they take a little longer because they're not constantly ready; it's as if they have to be started every time), but are only paid for as used.

Along these same lines, we've opted to store our documents in Google Drive for convenience; technically, this would be very similar to any other storage option. We did choose a Cloud Scheduler that calls once a week and checks for changes because it's very easy to configure and maintain separately (if we want to check more frequently, just change the Cloud Scheduler settings).

It could also be done the other way around, with Google Drive notifying us when there is a change, but it is a little more complex and it is not so easy to switch between one provider and another (I mean that what we have done is very easy to adapt to another storage, a Bucket in S3 or a Space in DigitalOcean, for example), the other way around would not be the case.

Embedding database : We needed to save these vectors, embeddings, related to the content we extracted from the PDFs, but we'll also need to do this later for the things we ask the bot to save. There are many AI-specific databases, such as Pinecone, Qdrant, Weaviate, Chroma, etc. Some are open source, while others are used as a service.

Within the Google Cloud platform, there is the AlloyDB product and, another alternative, is to use Cloud SQL, the standard database product, with a PostgreSQL database (open source and very standard in the development world), which has an extension called pgvector that allows you to save vectors and perform semantic searches.

We're not planning to compare them all here, but if we wanted to use Qdrant, for example, an open source machine on Google Cloud, we'd have to set up a dedicated machine in GCE or similar, or use it as a Kubernetes container (using the architecture we mentioned earlier, of doing everything in a cluster). The most cost-effective option is the one we've chosen, the smallest machine possible, with PostgreSQL and pgvector, for about $7 per month.

In terms of performance, for large volumes and high concurrency, databases created specifically for this purpose perform better, but these differences require a very specific use case to be worthwhile.

Integration with Google Chat : Something similar happens here to the case of the function for processing documents. We could have deployed it in a Kubernetes container. The solution we would have chosen would have been to set up a Qdrant database, for example, and we would have a cluster with all the containers: the one that processes documents, the vector database, and the container that processes the chat. But since they are not always in use, it seemed too much for this use case, so we also opted for a Cloud Functions function in Python, very similar to the other one.

For the integration itself, Google Chat allows direct integration with endpoints (calling a URL when there's a message), and that endpoint could be our function or our container. We decided to do it with App Script for simplicity, so the same script handles all possible chat calls (different functions are called from Google Chat when there's a message, when the bot is added to a new space, when it's removed from a space, etc.).

Okay, so how much does this cost?

Let's get down to business, in the end all this AI stuff is great, but we have to pay for it.

Cloud Functions have a free tier of 240,000 vCPU-seconds per month. Our functions use 1 vCPU and take an average of 4-5 seconds (the chat function takes much longer, but we're running it once a week). This would mean around 48,000 free calls per month. Keep in mind that this is aggregated per billing account on Google. If you have multiple projects with the same billing account and several of them use Cloud Functions, the free tier will be divided among them.

The Cloud Scheduler call is included in the free tier, as it is once a week.

The Cloud SQL instance we set up is the cheapest possible, a shared instance, without redundancy, with the Enterprise model, db-f1-micro, about $6-7 per month.

As for Gemini, the free tier is currently limited to 100 calls per day, specifically in the experimental model we use for embeddings. However, once you have a paid account associated with the project, tier 1 is limited to 1,000 calls per day. You can see all the prices here . Clearly, this is the area where the price can increase the most with usage, but the models used can also be significantly adjusted. We will update this article later when we have details on the monthly cost.

A little bit of code detail

No sería compartir si no explicásemos más en detalle cómo hacerlo.

Un tema sobre la seguridad, nosotros lo hemos hecho todo de tal forma que una cuenta de servicio del proyecto es la que tiene acceso a las distintas funciones de Cloud Functions, a la base de datos, etc. Es decir, nada es accesible desde fuera, a todo se accede desde una cuenta de servicio. 

Por otro lado, a nivel de código, todas las claves, tanto de la base de datos, como el API key de Gemini, etc, están almacenadas como secretos en el Google Secret Manager y la función accede a ellas a través de la cuenta de servicio, así no están en el código, y hay un método: 

def access_secret(secret_id):
    project_id = OUR_PROJECT
    name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
    secret_client = secretmanager.SecretManagerServiceClient()
    response = secret_client.access_secret_version(name=name)
    return response.payload.data.decode("UTF-8")

Que devuelve un secreto a partir de su id. Esto se usa luego en un método de setup de la función: 

def setup():
    logging.info("Setting up environment...")

    # Load secrets
    drive_folder_id = access_secret("DRIVE_FOLDER_ID")
    pg_user = access_secret("PG_USER")
    pg_password = access_secret("PG_PASSWORD")
    pg_database = access_secret("PG_DATABASE")
    pg_host = access_secret("PG_HOST")
    gemini_api_key = access_secret("GEMINI_API_KEY")

    # Initialize GenAI client
    genai_client = genai.Client(api_key=gemini_api_key)

    # Initialize Drive client
    credentials, _ = default(scopes=['https://www.googleapis.com/auth/drive.readonly'])
    drive_service = build('drive', 'v3', credentials=credentials)

luego la función que busca los ficheros y genera los embeddings sería: 

def poll_drive_and_update_embeddings(request=None):
    try:
        env = setup()
# asignamos las variables del setup
        ...

# buscamos los ficheros PDF de la carpeta de Google Drive
        query = f"'{drive_folder_id}' in parents and mimeType='application/pdf'"
        results = drive_service.files().list(
            q=query,
            fields="files(id, name, modifiedTime)",
            supportsAllDrives=True,
            includeItemsFromAllDrives=True,
        ).execute()
        files = results.get('files', [])
        
# Para cada fichero,si la fecha de actualización es posterior a 
# la última que tenemos guardada, borramos los embeddings que teníamos 
# y los generamos de nuevo
        for file in files:
            file_id = file['id']
            name = file['name']
            modified_time = date_parser.parse(file['modifiedTime'])
        
            with pool.connect() as db_conn:
                row = db_conn.execute(
                    sqlalchemy.text("SELECT MAX(created_at) FROM documents WHERE file_id = :file_id"),
                    {"file_id": file_id}
                ).fetchone()
                last_processed_time = row[0]

                if last_processed_time and last_processed_time >= modified_time:
                    logging.info(f"Skipping unchanged file: {name}")
                    continue

                db_conn.execute(
                    sqlalchemy.text("DELETE FROM documents WHERE file_id = :file_id"),
                    {"file_id": file_id}
                )
                db_conn.commit()

# Esta es la parte que descarga un fichero, trocea el mismo en partes y para cada
# parte genera los embeddings con una llamada al modelo de Gemini

                pdf_data = download_file(drive_service, file_id)
                text = extract_text_from_pdf(pdf_data)
                chunks = split_text_into_chunks(text)

                for i, chunk in enumerate(chunks):
                    embedding = generate_embedding(genai_client, chunk)
                    embedding_str = to_pgvector_string(embedding)
                    db_conn.execute(
                        sqlalchemy.text("""
                            INSERT INTO documents (file_id, file_name, chunk_index, content, embedding)
                            VALUES (:file_id, :file_name, :chunk_index, :content, :embedding)
                            """),
                        {
                            "file_id": file_id,
                            "file_name": name,
                            "chunk_index": i,
                            "content": chunk,
                            "embedding": embedding_str
                        }
                    )

                db_conn.commit()

        connector.close()
        return "Success", 200

    except Exception as e:
        logging.exception("An error occurred during execution.")
        return f"Error: {str(e)}", 500

Dentro de esa función, hay una llamada a generate_embeddings, que sería así: 

def generate_embedding(client, text: str) -> list[float]:
    response = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_DOCUMENT",
            output_dimensionality=768,
        ),
    )
    return response.embeddings[0].values

A medida que vayan surgiendo nuevos modelos habría que ir cambiando el modelo (esto podría haber sido otra variable de entorno o un secret). También hemos fijado el output a 768. Este modelo por defecto genera vectores de 3072 dimensiones. En nuestras pruebas, para este caso de uso, con 768 es suficiente. Solo hay tres valores posibles, en este modelo, 3072(por defecto),1536 o 768. 

La definición que hagamos de la base de datos, en nuestro caso: 

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    file_id TEXT,
    file_name TEXT,
    chunk_index INT,
    content TEXT,
    embedding VECTOR(768),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(file_id, chunk_index)
);

Tiene que ser coherente, claro, con esa definición del tamaño de los vectores. 

Conclusions

Bartolo is a practical example of how to apply AI in a useful and concrete way to a team's daily work. It's not a generic solution, but rather a specific integration that solves real needs: quickly accessing internal knowledge and keeping it alive through the team's regular conversations. We adapted new AI components to an already familiar environment to create a lightweight, integrated, and useful tool. We believe these types of solutions can inspire other teams to build on what they already have, with a layer of intelligence that truly adds value.

📫
Here’s today’s article. Feel free to reach out to us on social media as always, or at hola@softspring.es with any questions or suggestions!

Let’s work together!

Do you want to tell us your idea?

CONTACT US