File size: 6,043 Bytes
4f6b8d0
 
 
 
 
 
 
 
 
9e95e32
53c2d1b
c06eccb
 
9e95e32
53c2d1b
9e95e32
53c2d1b
9e95e32
53c2d1b
9e95e32
53c2d1b
9e95e32
53c2d1b
9e95e32
53c2d1b
9e95e32
53c2d1b
9e95e32
 
 
 
 
 
 
 
2137c11
 
 
 
 
 
 
 
 
 
 
 
 
 
9e95e32
2137c11
 
 
 
 
 
 
 
53c2d1b
 
27a79f7
53c2d1b
c810e2f
fe9e665
8c495c2
fe9e665
57d01e6
 
53c2d1b
57d01e6
c810e2f
 
53c2d1b
2137c11
c810e2f
2137c11
 
 
 
 
 
 
f1dfad5
9e95e32
f1dfad5
2137c11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1dfad5
2137c11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1dfad5
9e95e32
f1dfad5
2137c11
9e95e32
f1dfad5
2137c11
f1dfad5
 
 
 
 
 
2137c11
9e95e32
 
 
2137c11
 
 
 
9e95e32
 
 
 
2137c11
 
 
 
9e95e32
 
 
 
 
 
2137c11
 
 
 
9e95e32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: MOSAICapp
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
---


# MOSAICapp

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18394317.svg)](https://doi.org/10.5281/zenodo.18394317)

A web application for topic modelling of phenomenological reports using BERTopic and transformer embeddings.

**Web app:** [huggingface.co/spaces/romybeaute/MOSAICapp](https://huggingface.co/spaces/romybeaute/MOSAICapp)

## Statement of Need

Consciousness research increasingly relies on open-ended subjective reports to capture the richness of lived experience. Structured questionnaires like the Altered States of Consciousness scales or the MEQ impose predefined categories that can miss unexpected experiential dimensions.

MOSAICapp provides an alternative: instead of forcing reports into predefined categories, it uses neural topic modelling to discover thematic structure directly from the text. This "wide-angle" approach lets researchers see what participants actually describe before committing to a categorical framework.

The tool is designed for consciousness researchers, phenomenologists, and qualitative researchers working with text data who want computational analysis without writing code.

## Features

- **No-code interface** — upload CSV, configure parameters, download results
- **Sentence-level analysis** — optional segmentation for finer-grained themes
- **Interactive visualisations** — 2D topic maps, hierarchical clustering, topic distributions
- **LLM topic labelling** — automatic generation of interpretable labels (full version)
- **Python API**`mosaic_core` library for programmatic use and batch processing



---

## 1. Quick Start (No Installation)

The easiest way to use MOSAICapp is via the hosted web interface. No coding or installation is required.

**[Launch MOSAICapp on Hugging Face](https://huggingface.co/spaces/romybeaute/MOSAICapp)**

*Note: The hosted version runs on shared resources. For large datasets or privacy-sensitive data, we recommend the local installation below.*


---

## 2. Local Installation

Run the app on your own machine to use custom GPUs, process sensitive data locally, or modify the code.

### Prerequisites
- Python 3.9+
- Git


### Setup steps

```bash
git clone https://github.com/romybeaute/MOSAICapp.git
cd MOSAICapp

# Create virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies and the package
pip install -r requirements.txt
pip install .

# Download NLTK data (required for segmentation)
python -c "import nltk; nltk.download('punkt')"
```

---

## 3. Configuration & Running


### Run the app
```
streamlit run app.py
```

### LLM Setup (Optional)
To use the Automated Topic Labelling feature (Llama-3), you must provide a Hugging Face Access Token. The app uses this token to access the inference API.

1. Get a Token: Log in to Hugging Face and create a token with "Read" permissions.

2. Configure Local App:

- Create a folder named .streamlit in your root directory.

- Inside it, create a file named secrets.toml.

- Add your token in TOML file:
```
HF_TOKEN = "hf_..."
```

- Note: This file is ignored by Git to protect your credentials.


---

## 4. Running Tests
We include a test suite to verify the installation and core logic. This is useful to check if your environment is set up correctly.

**Run everything:**
```bash
pytest tests/ -v
```

**Run only fast tests:**
```bash
pytest tests/test_core_functions.py -v
```

This will automatically load a dummy dataset included in the repo and verify:

- Data loading (CSV parsing)

- Embedding generation

- Topic modelling pipeline

- Visualisation outputs

---

## 5. Python API (Advanced Usage)
MOSAICapp is also a Python library. You can import `mosaic_core` in your own scripts or Jupyter Notebooks for batch processing or custom analysis pipelines.

### Library usage
```python
from mosaic_core.core_functions import preprocess_and_embed, run_topic_model

# 1. Load and Preprocess
docs, embeddings = preprocess_and_embed("data.csv", text_col="report")

# 2. Configure Parameters
config = {
    "umap_params": {"n_neighbors": 15, "n_components": 5},
    "hdbscan_params": {"min_cluster_size": 10},
    "bt_params": {"nr_topics": "auto"}
}

# 3. Run Model
model, reduced_embeddings, topics = run_topic_model(docs, embeddings, config)
```





## Input format

CSV file with a text column. The app auto-detects columns named `text`, `report`, `reflection_answer`, or `reflection_answer_english`. Any column can also be selected manually.


---


## How it works

MOSAICapp implements a BERTopic pipeline: texts are embedded using sentence transformers, reduced with UMAP, clustered with HDBSCAN, and labelled using c-TF-IDF (with optional LLM refinement). This approach captures semantic context better than older bag-of-words methods like LDA.

For methodological details, see the [MOSAIC paper](https://arxiv.org/abs/2502.18318).



---

## Research applications

MOSAICapp has been used to analyse:

- Stroboscopic light experiences from the Dreamachine project
- Descriptions of "pure awareness" from the Minimal Phenomenal Experience study  
- Psychedelic experience reports (DMT, 5-MeO-DMT micro-phenomenological interviews)

## Citation

```bibtex
@article{beaute2025mosaic,
  title={Mapping of Subjective Accounts into Interpreted Clusters (MOSAIC): 
         Topic Modelling and LLM Applied to Stroboscopic Phenomenology},
  author={Beauté, Romy and Schwartzman, David J and Dumas, Guillaume and 
          Crook, Jennifer and Macpherson, Fiona and Barrett, Adam B and Seth, Anil K},
  journal={arXiv preprint arXiv:2502.18318},
  year={2025}
}
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on reporting bugs, suggesting features, and contributing code.



## License

MIT

## Acknowledgements

Built with [BERTopic](https://github.com/MaartenGr/BERTopic) by Maarten Grootendorst. Funded by the Be.AI Leverhulme doctoral scholarship at the Sussex Centre for Consciousness Science.