Introducing The AnVIL Data Explorer (Beta Release)
We are excited to introduce users of the AnVIL Portal to the AnVIL Data Explorer — the AnVIL Data Explorer’s faceted search feature that allows you to create cohorts across datasets based on your sample-level needs.
What is the AnVIL Data Explorer?
Until now, the way to browse datasets through the AnVIL portal has been to use the AnVIL Dataset Catalog, which allows you to organize your search results based on the workspaces that contain a subset of a dataset by its consent code. The AnVIL Dataset Catalog provides summary level information of the dataset and study.
With the addition of the AnVIL Data Explorer, you’ll be able to sort the datasets you’re browsing based on the following managed access categories:
- Datasets
- Donors
- BioSamples
- Activities
- Files
Additionally, while the AnVIL Dataset Catalog has been limited to open-access data exploration, the AnVIL Data Explorer enables detailed searching across managed-access data.
You’ll now be able to streamline your data gathering process in a way that allows you to create custom cohorts to keep your workspaces efficient and organized. You’ll be able to choose exactly the data you want to work with!
What are Best Practices for Working with NIH data?
When working with NIH data, particularly managed access data, users must follow their data use agreements. Terra users working with NIH data agree to our Terms of Service and we request users leverage extra security features when working with NIH data.
How do I use the AnVIL Data Explorer?
Below you’ll find step-by-step instructions for navigating the AnVIL Data Explorer and exporting data to Terra. In order to successfully import data from the AnVIL Data Explorer to Terra, you’ll need to make your selection in the AnVIL Data Explorer, export this selection to a Terra workspace, and (as an interim stopgap) run a final step in the form of a Jupyter Notebook in order to retrieve all of the data relevant to your selection.
Step 1: Finding the AnVIL Data Explorer
The AnVIL Data Explorer can be found by navigating to the AnVIL Portal, and clicking on the Datasets button:
Previously, this was the way to find the AnVIL Dataset Catalog. Now that we are adding the AnVIL Data Explorer feature, you’ll find both of these options in a dropdown menu under this button.
Step 2: Authentication
Make sure you have completed the necessary Terra registration steps.
Note for AnVIL Data Custodians: In many cases, AnVIL users that were part of data generating consortia will be granted access to workspaces that are already configured to receive data. However, if you intend to add data to workspaces of your own, you may need to configure your own Terra Billing Project. For detailed instructions on setting up Billing Projects, please refer to our article on “How to set up billing in Terra.”
Step 3: Selecting Data
Search and Filtering Data
You can sort and filter studies based on a wide range of facets, from sample- and donor-level information to any of the fields in the editable column view.
The facets are visible in the screen above in the column on the left. If you click on any of these facets, you can see a more detailed view of studies that fall into the filters you’ve chosen.
When you select multiple facets, only data matching all selected facets is displayed (e.g. filter by Anatomical Site AND BioSample Type). When you select multiple values within a facet, data matching any of the facet values is displayed (e.g. selecting both Blood and Tissue above will list studies that include Blood OR Tissue samples).
Exploring Studies
Note: Currently the summary pages of studies in the AnVIL Data Explorer are not populated with information, yet the information described below will be populated for all of the studies as development progresses.
When you click on a study, you’ll be taken a summary page where you can find a variety of information and helpful links, including but not limited to:
- What consortium the data is associated with
- The quantity and types of data
- Links to APIs for accessing the data programmatically
- Links to request access
- A button for exporting the data to a Terra workspace
Step 4: Exporting Data
To use data you’ve found through the AnVIL Data Explorer within your Terra workspace, you’ll need to follow two steps:
First, you export the data from the AnVIL Data Explorer to Terra.
Second, currently, you will need to use a publicly available notebook to fill in some of the metadata that doesn’t populate automatically.
Exporting from The AnVIL Data Explorer
Once you’re ready to export the data, you can click Export to Terra from within a particular study, or you can also click the Export button at the top right of your screen when you are on the AnVIL Data Explorer’s main page.
Clicking this button will take you to a window where you can export to a Terra workspace through the user interface.
Once you select Analyze in Terra, you’ll see a button labeled “Request Link”.
After you click this button, you will be prompted to wait while the system generates a link to Terra. Once this link is ready, you’ll see a page with a button labeled “Open Terra”. Clicking this button will take you to a workspace selection screen in Terra, where you’ll be able to select the workspace to which you’d like to add this data.
Working with the Data in Terra
Until recently, AnVIL data has been hosted and shared from multiple Terra workspaces making it hard to generate cohorts across differing studies. To resolve this, we created the AnVIL Data Explorer enabling you to create custom cohorts and then hand them off to your own Terra workspaces.
Depending on the AnVIL dataset/study, the data in question have varying schemas (different columns and structure to the data). In an effort to ingest all of the AnVIL datasets, the Broad's Data Sciences Platform created a common subset schema across all AnVIL datasets. When you use the AnVIL Data Explorer, it actually searches through a specialized subset - called the Findability Subset (FSS) - that only contains the attributes which are most commonly used by researchers across a broad range of study data types and for diverse analyses.
At this point in development, the data that you hand off from the AnVIL Data Explorer to your Terra workspace is incomplete - it only contains the columns that were deemed relevant for the FSS used in the AnVIL Data Explorer. Currently, if you want the complete data, you will need to perform one additional step after exporting to your workspace. This step is performed by running a publicly available Jupyter Notebook, as instructed below.
Working with NIH Data in Terra
When working with NIH data in Terra, we require users to import data to workspaces with the checkbox for protected data marked. Optionally, an Authorization Domain may be applied and is highly recommended if working with controlled access data.
In the data handoff from the AnVIL Data Explorer, you will transition to the Terra import screen. When importing from an NIH repository, you will see on the left-hand side the message “The data you chose to import to Terra are identified as protected and require additional security settings. Please select a workspace that has an Authorization Domain and/or protected data setting.”
Selecting Workspace
Next, you’ll see a workspace selection screen where you can either choose an existing workspace or create a new workspace to receive the data.
If you choose to “start with an existing workspace”, you can only select workspaces for which you have write access and that have the required security settings. If you cannot select a workspace of interest (it is grayed out), it means this workspace is non-compliant and you should create a new one with the required security settings.
If you choose to “start with a new workspace”, you will see that the import was recognized as coming from an NIH repository and the protected data checkbox is added. You can optionally add an Authorization Domain.
Once you’ve completed this step, your workspace will spin up and you can go to the Data tab of your workspace to see a set of tables have been successfully imported into the workspace.
Using the Notebook
The data you now see in your workspace is incomplete - to retrieve the complete data, you’ll need to run a Jupyter Notebook created for this purpose called *get_non_findability_subset_data_v7.ipynb*. We’ve published this notebook in a public workspace for your convenience. To complete the import process, all you need to do is copy the notebook from the public workspace into the workspace to which you’ve imported your data, and run the notebook from within that workspace.
Copy the Notebook
Go to the public workspace containing the AnVIL FSS tool notebook, navigate to the Analyses section, and use the three-dot menu to the right of the notebook name to find the option to copy the notebook to another workspace:
Run the Notebook
Once the notebook is in the same workspace as the data, set up a Clou open the notebook in edit mode (the default environment is sufficient), and select “Run All” from the Cell menu:
Once you’ve completed this step, you should be able to see a new set of tables in the Data tab of your workspace.