eordano.com

Full Text Search

2022-09-22 - written as a personal note while working on Regeneratio

Full text search is extremely relevant to our day-to-day lives in the information age. We rely on search engines, both web and in-app, to navigate through haystacks of information, and we don't necessarily have first-hand access to that data.

The Search Prompt

The web search engine that we use everyday is super critical and relevant for our information consumption. Giants like Google and Amazon understand this really well. The search box is one of the most important elements of any user interface, and in some cases, it's a product out of itself. Operating systems also generally provide an interface to search through your local files, but this is generally not really well crafted nor optimized -- it's generally slow or incorrectly indexed.

In productivity apps, this is also a key element. Dropbox, Google Drive, and other cloud systems excel when you can find the thing that you were looking for without having to previously manually index through those files.

For social applications, like Whatsapp or Signal, this is also a very used feature: finding the correct information, navigating to the correct contact conversation to find where was a particular keyword mentioned, is a very important aspect of the search functionality they can provide.

Indexing Data

Most zettlekasten/personal information systems emphasize the importance of creating a good index for your information. That has the advantage of the system learning together with your brain, helping your biological brain work on what's best for it (keeping a few hot subjects in mind) with what a mechanical brain is best at: storing information with a high degree of durability.

With our cloud systems, specially when they are confined to cloud storage, this ability is provided by a server that runs remotely, and needs complete access to all your memory/data in order to work efficiently. With the advent of personal data protection, crating better products that search through your personal information might turn into a challenge. Security needs to be built in into these systems of remembering.

Making more data legible

Much of the information we have is not easily accessed. The blog entries that you read, the stories that you saw on social network, the messages that you read from your friends, family, and colleages are all stored on different application silos, and making it all searchable in a common interface is not an easy task.

There's also context that gets lost in the shape of multiple channels, for example, when a conversation starts on an instant messaging service, and then follows on a video call, to then culminate in a word processing document. The context of the activity, the situation that you were into when delving into a subject, gets lost because the systems don't easily communicate with each other. Cloud systems can't link back to the activity that you were doing on the previous application, unless it's tightly integrated (for example, linking Jira with a version control system).

A browser's history could be stored with much more information, for example, a Firefox "readability" archive of all the websites you have explored recently. The browser as the main OS of people prevents the effective communication between web applications, something that is probably secure but also improductive for the users.

Another problem of gathering data is the lack of legibility of some activities, specially primarily-voice applications. With OpenAI Whisper, a personal computing system could create a transcript of a conversation, and link that back to what came before and after that conversation.

From Insight to Product

Correctly indexing the importance of each ocurrence of a word is no easy task. It seems that google is losing its might, and competitors such as DuckDuckGo and Kagi Search are picking up on the spoils. It requires context, it requires trying to understand why the user is asking for that, and different approaches will yield different results. A good place to start is to weight ocurrences on titles of documents more than ocurrences on the body of the text (for example, in the case of PDFs, blog entries, or Word-like documents). Another strategy to do this is to measure click-through rate and time spent on the page for other users (for example, in the case of browser engines).

What if all the data in your applications were accessible through the same prompt? This is the insight that drives Spotlight on MacOS -- search through all your iMessages, history of Safari, Applications, and iCloud drive at the same time. But there might be more information that is not readable by spotlight -- for example, Whatsapp messages, tweets that you have written, or words that you spoke or heard on previous meetings.

There are obvious security and privacy concerns with a product like this (let's call it Shelly, a mix of Sherlock Holmes and a personal assistant that knows exactly what you need). Getting access to Shelly secures access to all your conversations, documents, emails, browsing history. Similar to breaking through to your Google Takeout archive, the amount of information that a perpetrator might acquire is potentially too valuable.

But, the outcome might be very interesting: a way more personal exploration of the information that you have created or consumed, with understanding of the context in which you interacted with that information. One could imagine such an interaction with Shelly in the following way:

> You: Shelly, what was that article that I read a few weeks ago about Google
> losing it's dominant position as the best search engine?
>
> Shelly: You read
> [Why the world needs a non-profit search
> engine](https://daoudclarke.net/search%20engines/2022/07/10/non-profit-search-engine)
> on July 10th, 2022. The term "best search engine" also appears three times on
> your August 2nd, 2022 conversation with Nicolas R.
>
> You: Great, take me to the blog post

 

But, that interface doesn't need to be so "Siri"-like.  

<strong>Search Prompt:</strong> "best search engine"
<strong>Results</strong>: 1. <a href="https://daoudclarke.net/search%20engines/2022/07/10/non-profit-search-engine">
  [Why the world needs a non-profit search engine]
</a>
[website] [last visit: July 10th, 2022] 2.{" "}
<a href="#">[Conversation with Nicolas R.]</a>
[private conversation] [August 2nd, 2022] reason: three mentions of "best search
engine" 3. <a href="#">[Google donates "duck.com" to DuckDuckGo]</a>
[website] [last visit: January 23th, 2020]

 

Which could be equaly useful, specially if navigating with a keyboard.

 

<strong>Search Prompt:</strong> "places in Spain mentioned in conversation
with Alice"
<strong>Results</strong>: 1. <a href="#">[Conversation with Alice P.]</a>
[private conversation] [October 1st, 2022] fragment transcript: "we're
planning to go to Madrid next week"

 

Shelly looks like something that could have a lot of potential, but this idea needs further exploration. It surely would solve some of the burden of trying to look up information that was not correctly indexed at the time, but that computers and information systems can extract from our usage patterns.