About our use of Large Language Models

The Distributed Design Learning Hub relies on both editorial content and curation, and the use of Large Language Models (LLMs) to help assist with discovering information and resources in the Distributed Design Platform archive.

Wherever you see text that is✨ set in a light blue colour, preceded by the "sparkle" emoji, that text has been generated by an LLM. Such text will always be preceded by a notice linking to this page for further details. Any other content you see on this site has been written, by hand, by a subject-matter expert from the Distributed Design Platform community. At present, LLM-generated text is only used in the responses generated for queries for custom themes, using the search box on the homepage.

LLM generated text will always be an automatically generated summary of some content from within the Distributed Design Platform archive, and the sources used to generate it will always be linked and prominently referenced.

If you encounter any LLM generated content which you believe to be inaccurate, discriminatory, or harmful, please report it to us using the feedback form, and we will review it and take action.

Some tips for working with LLM generated summaries

Always corroborate information using the referenced sources. Our summaries are intended as an aid to discovery, not as a source of factual information in their own right. Use them to guide your research, but please click through to the referenced resource to verify any factual claims.
Bear in mind that summaries are based on the information in the Distributed Design Platform archive, and as such, reflect the research, interests, and knowledge of our community in particular. Summaries for broad themes and topics will reflect our community's engagement with them, not give an entire overview of the field.
Always verify factual claims, particularly those liable to change over time. Our archive contains the archive of outputs of more than 10 years of research, and a lot can change over that time. Please do further research to check that the summarised information is still current!

Our commitment to responsible use of machine learning technologies

We are committed to using machine learning technologies such as LLMs in a way which is consistent with the principles of Distributed Design, and to taking concrete steps to mitigate the ecolological and social harms to which they can give rise.

In particular, we aim to use LLMs in a way which is:

Transparent, both in the sense that text generated using an LLM should be explicitly marked as such, and that the processes, algorithms, and data used to generate the text shuold be explained publicly and understandably. In addition, we aim to be explicit about the limitations of such text in terms of its accuracy or comprehensiveness.
Responsible, in that we take concrete steps to mitigate the social and environmental risks that LLMs and other machine learning technologies have the potential to present.
Complementary to our knowledge and skills, and those of the wider Distributed Design community. Our use of LLMs is not intended to replace or supplant domain knowledge or editorial skills, but to augment them.

We do so through our choice of technologies: We selected the mixtral-8x-22B model not just for its performance, but also because it offers robust guardrails against discriminatory and harmful content and is requires less computational power, and therefore material resources, than other similar models.

In addition, this commitment is reflected as a design concern: We use LLMs only as a fallback to generate specific content when we do not have content already available, and we cache responses to ensure that the amount of text generated is kept to a minimum. In addition, our strategy of producing only successive summaries of text from within the Distributed Design platform archive cuts down on the likelihood that discriminatory or harmful content is generated.

Finally, we have tested and audited the system for bias and harm pre-launch, and review all generated content on an ongoing basis for accuracy and fairness. We hope that as well as providing a useful service, the Distributed Design Learning Hub might also serve as an example of best-practice for incorporating machine learning technologies in projects in a way that reflects and respects the values of the Distributed Design community.

How we use LLMs to generate summaries

The Learning Hub search interface operates, broadly, as a Retrieval-Augmented Generation (RAG) system.

Our editorial team have written, by hand, a summary for each document in the Learning Hub archive, and categorised them by hand into thematic areas, and more general tags. This editorial content, plus the text of the document itself, is used to create embeddings - mathematical representations of portions of the document which can be compared to each other (and to text queries) in a way that allows us to query the resulting database by semantic similarity.

When you search for a custom theme, using the search form on the homepage, your query is also converted into an embedding, from which we retrieve the most similar document fragments to your query (each of them scored by how similar it is). We then collate these by the documents that they were taken from, and sort them so that the most similar document appears first - these are the search results you see alongside the summary on the results page.

For theses documents we then provide the text of the retrieved fragments to a language model, along with a prompt to summarise how topic of your query is addressed in the document. Any documents with no relevant information are summarised, and we continue down the list until we have a maximum of three summaries. These are then presented back to you, along with links to the source.

Finally, we provide the text of all the fragments of the summarised documents to the language model, along with a prompt to summarise the topic in general, using the information provided. This is added as the first sentence to the summary.

For those who want more technical details: our embeddings are created with the mistral-embed model, stored in an elasticsearch database, and similarity is measured by cosine distance between the query and the fragment. We use the mixtral-8x-22B large language model to generate summaries, and llama_index as plumbing for the whole system. The entire application is open source, and available on Github.