YouTube Video Transcript Search and Idea Generation with Qdrant and OpenAI#71
YouTube Video Transcript Search and Idea Generation with Qdrant and OpenAI#71mwitiderrick wants to merge 2 commits intoqdrant:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces an end‑to‑end example demonstrating semantic search over YouTube video transcripts with Qdrant and idea generation using OpenAI. Key changes include:
- Adding helper functions (e.g., embed_and_store, search_similar_transcripts, generate_video_idea) for handling embeddings, storage, and idea generation.
- Implementing new API endpoints and management commands for YouTube processing and periodic task scheduling.
- Expanding documentation (README.MD) to guide setup and usage.
Reviewed Changes
Copilot reviewed 71 out of 73 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| video-generation/backend/api/youtube_utils.py | Adds YouTube authentication, transcript embedding, and storage functions using Qdrant and OpenAI. |
| video-generation/backend/api/views.py | Provides API endpoints for user API keys, task handling, and video generation requests. |
| video-generation/backend/api/urls.py | Registers new API endpoints for task management and video processing. |
| video-generation/backend/api/transcription.py | Implements audio transcription using OpenAI Whisper API. |
| video-generation/backend/api/tests.py | Placeholder for tests. |
| video-generation/backend/api/redis_client.py | Sets up Redis client for task status. |
| video-generation/backend/api/qdrant_utils.py | Adds functions for semantic search and video idea generation via Qdrant and OpenAI. |
| video-generation/backend/api/models.py | Placeholder for Django models. |
| video-generation/backend/api/management/commands/schedule_tasks.py | Schedules periodic tasks for video creation and vector DB updates. |
| video-generation/backend/api/management/commands/run_youtube_process.py | Provides CLI command to trigger YouTube processing tasks. |
| video-generation/backend/api/management/commands/create_qdrant_collection.py | Command to ensure Qdrant collection exists before processing. |
| video-generation/backend/api/apps.py | Standard Django app configuration. |
| video-generation/README.MD | Updates documentation for project setup, usage, and API integration. |
Files not reviewed (2)
- video-generation/backend/.gitignore: Language not supported
- video-generation/backend/Dockerfile: Language not supported
Comments suppressed due to low confidence (1)
video-generation/backend/api/views.py:25
- The task 'test_celery_task' is referenced without being imported or defined; ensure it is correctly imported or updated to the appropriate task.
task = test_celery_task.delay(2, 3)
|
|
||
|
|
||
|
|
||
| response = openai.chat.completions.create( |
There was a problem hiding this comment.
The call 'openai.chat.completions.create' appears to be incorrect; update it to 'openai.ChatCompletion.create' as per the OpenAI API specification.
| response = openai.chat.completions.create( | |
| response = openai.ChatCompletion.create( |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
kacperlukawski
left a comment
There was a problem hiding this comment.
@mwitiderrick Thanks a lot for contributing this application! I'm not merging it, as I don't think it belongs in the "examples" category. Please let me clarify that a bit. An example for us is a digestible piece presenting how to use a certain Qdrant functionality. However, this seems to be a fully-fledged application that requires multiple systems to run, so not that many people will be able to test it on their own.
Keeping this application in a separate repository would make sense, as we do with all the other demos. A running version should also be hosted somewhere to attract interest.
I left some minor comments, but in general, I think the idea for the app is really neat! The app looks like any standard Django application.
Thanks again for doing this effort!
| qdrant = QdrantClient(url=os.getenv("QDRANT_HOST"),prefer_grpc=False ) | ||
|
|
||
| def ensure_qdrant_collection(): | ||
| if not qdrant.collection_exists("video_transcripts"): | ||
| qdrant.create_collection( | ||
| collection_name="video_transcripts", | ||
| vectors_config=VectorParams( | ||
| size=1536, | ||
| distance=Distance.COSINE | ||
| ) | ||
| ) | ||
|
|
||
|
|
||
| def embed_and_store(user, text, metadata): | ||
| logger.info(f"[🔑] Starting embed_and_store for user {user.id} with metadata: {metadata}") | ||
|
|
||
| try: | ||
| client = OpenAI(api_key=user.openai_api_key_decrypted) | ||
| logger.info("[🧠] Initialized OpenAI client.") | ||
| except Exception as e: | ||
| logger.exception("[❌] Failed to initialize OpenAI client.") | ||
| raise e | ||
|
|
||
| try: | ||
| response = client.embeddings.create( | ||
| input=[text], | ||
| model="text-embedding-ada-002" | ||
| ) | ||
| embedding = response.data[0].embedding | ||
| logger.info("[✅] Embedding successfully created.") | ||
| except Exception as e: | ||
| logger.exception("[❌] Failed to generate embedding.") | ||
| raise e | ||
|
|
||
| try: | ||
| point_id = str(uuid.uuid4()) | ||
| logger.info(f"[🆔] Generated UUID: {point_id}") | ||
|
|
||
| point = PointStruct(id=point_id, vector=embedding, payload=metadata) | ||
| logger.info("[📦] PointStruct created.") | ||
|
|
||
| qdrant.upsert("video_transcripts", [point]) | ||
| logger.info(f"[📤] Upserted into Qdrant with point ID {point_id}") | ||
|
|
||
| return point_id | ||
|
|
||
| except Exception as e: | ||
| logger.exception("[❌] Failed to upsert into Qdrant.") | ||
| raise e |
There was a problem hiding this comment.
I think these functions do not belong here and should be put in qdrant_utils.py instead. I struggled a bit with finding them.
There was a problem hiding this comment.
Is that file required at all?
| def encrypt_value(value): | ||
| if not value: | ||
| return None | ||
| f = get_fernet() | ||
| return f.encrypt(value.encode()).decode() | ||
|
|
||
| def decrypt_value(value): | ||
| if not value: | ||
| return None | ||
| try: | ||
| f = get_fernet() | ||
| return f.decrypt(value.encode()).decode() | ||
| except InvalidToken: | ||
| return "[DECRYPTION_FAILED]" |
There was a problem hiding this comment.
I love that you implemented this! Many people will store everything in plaintext.
| class Video(models.Model): | ||
| id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) | ||
| user = models.ForeignKey(User, on_delete=models.CASCADE, related_name="videos") | ||
| title = models.CharField(max_length=255) | ||
| description = models.TextField() | ||
| video_url = models.URLField() | ||
| created_at = models.DateTimeField(auto_now_add=True) | ||
|
|
||
| def __str__(self): | ||
| return self.title No newline at end of file |
There was a problem hiding this comment.
It is a bit confusing that the Django app is called users and there are some other models here. First of all, I checked the api app, but there were no models at all, which I found quite intriguing.
This PR adds an end-to-end example demonstrating using Qdrant for semantic search over YouTube video transcripts combined with OpenAI to generate new video ideas based on past content.
Specifically, it includes:
✅ Setup of a
video_transcriptscollection in Qdrant with vector embeddings (text-embedding-ada-002) and metadata payloads (e.g.,transcript,user_id).✅
embed_and_store()function to generate embeddings for transcripts and store them with metadata in Qdrant.✅
search_similar_transcripts()function to semantically search for transcripts similar to a query text, filtering by user ID.✅
generate_video_idea()function that uses OpenAI Chat Completions API to propose a new video idea based on retrieved similar transcripts.✨ Why This is Useful:
Demonstrates a practical use case of combining vector search (Qdrant) with language generation (OpenAI).
Shows how to store text and metadata together with embeddings for richer retrieval.
Provides a real-world example relevant to creators, marketers, and content platforms.
🔧 Technologies Used:
Qdrant Client
OpenAI Embedding API
OpenAI Chat Completions API