# About | How It's Made
### User Story
- As a user, I want to be presented with the most relevant news for a growing set of healthcare and technology topics. Additionally:
- I want the presentation to automatically update with more relevant news, as it becomes available.
- I want to be presented with summary content such as the key players, issues and challenges for each topic, derived from the most relevant content.
- I want the presentation to include internal links to adjacent content, so that I can continue and expand my research.
### Getting the News Firehose
Daily news content starts as a dataset of URLs, populated from the following:
- RSS feeds of several hundred national and healthcare-related publications.
- Programmatically scripted Bing News and Google Alerts feeds, configured to filter for tech and healthcare-related addendums to general topics.
### Turning URLs into Content
Scraping web content for the entire firehose of daily news URLs would be expensive and of limited value, so stories are initially prioritized by:
- Authoritative sources - approx. 250 out of 200,000 news sources tracked by the system are considered authoritative today. This list is continually updated.
- RSS news titles are checked for healthcare and technology terms - the same list of growing terms found in the Taxonomy.
- This prioritized URL list is then fed into content APIs for retrieval as JSON structured text.
### Adding Classification Detail
- Once received, each news item is coded via OpenAI for **Tags**, **Companies** mentioned, **Persons** mentioned, **Concepts** and **Summaries**. These appends all have a part to play later on.
- I have found that gpt-4o-mini seems to have the best balance of accuracy and cost for this classification task.
### Initial Vector Embedding
- In order to support true semantic scoring, each Topic (Entity) in the system has a [vector embed](https://medium.com/kx-systems/vector-embedding-101-the-new-building-blocks-for-generative-ai-a5f598a806ba) (OpenAI test-embedding-3-small) appended.
- Further, each append made earlier (Tags, Concepts etc) is given a vector embed using the same model.
### Creating Topics|News Intersections
Intersections of the Topics and News tables are created in an intersection table, which enables scoring for each combination. This table includes > 15M records today. Potential combinations are generated via two paths:
1. **Traditional Keyword Search** - Generally, instances of the Topic phrase found within the news article title and content.
2. **Semantic Scores** - Vector similarities >.6 are turned into Topics|News intersections for scoring. Example: *Telehealth, Virtual Care, Telemedicine, Virtual Health, Teleconsult and Remote Healthcare* will all have a high vector similarity, even though a strict keyword search will not match across those phrases.
### Determining News Relevance
Once the Topics|News table has intersections to score, it's time to append signals and scores to each item. Topics|News intersections for healthcare and technology topics have the following signals:
| No. | Signal | Description |
| --- | -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1. | **Term in Post Title** | Is the Term found within the Post Title? |
| 2. | **Instances in First Third Content** | How many instances of the Term are found within the first third of the article content? |
| 3. | **Found in RSS Summary** | Is the Term found within the RSS summary provided in Google News, etc? |
| 4. | **Found in GPT Appends** | Is the Term found within the article appends/classification provided by OpenAI? |
| 5. | **HC Terms in First Third Content** | How many of the 500+ healthcare terms being tracked are found within the first third of the article content?<br><br>Assumption: News articles containing many healthcare terms are more likely to be relevant to this audience. |
| 6. | **Tech Terms in First Third Content** | How many of the 400+ tech terms being tracked are found within the first third of the article content?<br><br>Assumption: News articles containing many technology terms are more likely to be relevant to this audience. |
| 7. | **Best Semantic Score** | What is the highest vector similarity score available between the Term and all classification terms/concepts for the article text? |
| 8. | **Is a Press Release Source** | Is the article source a known PR source? |
| 9. | **Is a Direct Healthcare Publication** | Is the article source a known healthcare publication? |
| 10. | **Is Authoritative Source** | Is the article source one of the 200+ authoritative sources? |
Rube Goldberg may want a word at this point, but that's the fun of a personal project isn't it - playing with the tools to create something novel and hopefully, useful.
### Appending Scores
- The above signals are now scored using an algorithm that is constantly updated.
- Note that Source entities such as [[Beckers ASC Review]] will only have a subset of the established signals (#5, #6, #8, #9 and #10) as we are filtering for the most generally relevant content within that source, rather than for a particular phrase.
### Determining the News Set
News sets are calculated weekly, as follows:
- The top 1/3 of news articles (by score) for each topic are flagged, and will be used in the next step to generate executive summaries.
- The top 25 news articles for each topic are flagged to be included on the list of links within each page. In this manner the most relevant content is constantly 'rising to the top'.
### Executive Summaries
- Depending on the taxonomy for each topic, the top 1/3 news summaries by score are now coded (gpt-4o-mini) for items such as **Key Players** (companies) mentioned, **Partnerships and Collaborations** found, **Innovations** and **Challenges** articulated in the texts.
- These summary lines are then:
- Parsed and aggregated to surface new healthcare/tech topics that should be added to the system.
- Vector embedded to enable duplication checking.
- Scored and ranked for inclusion on each page.
- Note: I have found that a 2-step prompt process results in better accuracy - first narrowing the text to a concise summary, then submitting that same summary to query for the relevant fact.
### Related Topics
- Related Topics are determined by vector similarity between all entities in the system, and are updated weekly as new topics are added.
### Publishing
- Pages are created in Markdown format and pushed to a file, which is sync'd to an Obsidian Publish site.
- Obsidian handles the integrated search functionality as well as internal links between pages, so that:
`[[340B Program]]` in the .md page will always work as an internal link, rather than requiring the absolute URL to that page, which is brittle.