Machine Learning in Galaxy

TDM is mostly based on statistical, machine Learning and artificial intelligence methods, algorithms and technologies. Several projects provide these tools in open access.

LAPPS Grid

  • Vassar College, Poughkeepsie, NY USA
  • Johns Hopkins University, Baltimore, MD, USA

LAPPS Grid – An open, interoperable Web service platform based on Galaxy for natural language processing (NLP) research and development.

The Grid provides seamless access to popular public tools (such as Stanford NLP, OpenNLP, NLTK, and LingPipe), as well as a variety of tools and modules available in GATE (General Architecture for Text Engineering) and various UIMA (Unstructured Information Management Architecture) platforms, machine learning facilities, and a state-of-the-art Open Advancement (OA) evaluation system developed at Carnegie Mellon University and used in the development of IBM’s Jeopardy-winning Watson. The LAPPS Grid also provides access to several mainstream resources and, through federation with two major EU-CLARIN frameworks, access to hundreds of additional tools and data sources in multiple Western European languages. Most crucially, the LAPPS Grid allows users to use all of the tools and resources it provides interoperably in a seamless “plug-and-play” workflow environment, thereby eliminating the effort required to harmonize input and output formats to use a set of tools together.

OpenMinTeD

  • Athena Research and Innovation Center in Information, Communication and Knowledge Technologies

OpenMinTeD – has brought Open Access (OA) scholarly content from a wide range of providers and Text and Data Mining (TDM) tools from various Natural Language Processing (NLP) frameworks together in the same platform. The overarching goal of OpenMinTeD has been to help users who want to mine scientific literature by running the respective workflows of TDM tools.

A Machine Learning Tool Suite for Galaxy

  • Department of Biomedical Engineering, Oregon Health and Science University
  • Department of Computer Science, University of Freiburg

To make machine learning available to the Galaxy community, a machine learning tool suite for Galaxy was developed. There are tools for normalization/standardization, feature selection, model training, and evaluation. Tools can be connected together into complete machine learning pipelines such as a pipeline to (i) normalize features; (ii) select features based on importance; and (iii) use selected features to create a predictive model. There are 8 standardization approaches, 12 feature selection approaches, more than 40 different types of models including linear, Bayesian, kernel-based, and ensemble approaches, and 6 evaluation metrics including area under the curve and cross validation. The proposed tools are also suitable for advanced approaches to machine learning. This includes tools for learning on imbalanced datasets, stacking/blending, and deep learning.

Tools and pipelines can be customized in a variety of ways, such as by setting model parameters or selecting a unique scoring function. In summary the tool suite can be used to create thousands of different machine learning pipelines. There is also a tool for hyper parameter optimization either for an individual model or for an entire pipeline. Our tools are implemented using popular Python toolkits such as scikit-learn, xgboost, and scikit-rebate.

These machine learning tools can be combined with other Galaxy tools to create end-to-end workflows where primary data such as genomics or imaging datasets are processed to quantify features and machine learning is used to develop a model that predicts phenotypic attributes from features.      

 

 

Galaxy-E: Openrefine, R Shiny

  • French museum of natural history

Galaxy-E project integrated interactive tools: OpenRefine and R Shiny.

OpenRefine is an open source powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with Web services and external data.

Shiny is an R package that makes it easy to build interactive Web apps straight from R. You can host standalone apps on a Webpage or embed them in R Markdown documents or build dashboards.

Three years ago, the idea of the creation of a Galaxy for ecology project emerged to help biodiversity oriented citizen science projects members sharing data, tools and analytical processes. Since 2018, Galaxy-E evolved as a core component of French (“65 Millions d’observateurs” national project http://cesco.mnhn.fr/fr/65-millions-dobservateurs-6094, “Pôle national de données de biodiversité” national research e-infrastructure http://www.patrinat.fr/fr/pole-national-de-donnees-de-biodiversite-pndb-6256) and international (H2020 GAPARS http://gapars.mmos.ch/, H2020 EOSC, EuroGEOSS “Biodiversity and Ecosystem” Action group, GEO BON French Essential Biodiversity Variables operationalization pilot, GO FAIR BiodiFAIRse Implementation Network https://www.go-fair.org/implementation-networks/overview/biodifairse/) initiatives.

Jupyter notebook, R Studio in Galaxy (RealTimeTools)

Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. The Jupyter Notebook is an open-source Web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Its uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Tool Prediction in Galaxy Workflows using Deep Learning

  • Freiburg Galaxy Team, Bioinformatics lab, University of Freiburg, Freiburg, Germany

A recommendation system was developed to predict following tools. The predictive system analyses the complete set of workflows available on Galaxy’s European server using a deep learning approach to create a tool prediction model.

Workflows are directed acyclic graphs. To create the predictive model, sequences (paths) of tools are extracted from these graphs and learned by a deep learning approach (Gated Recurrent Neural Network). The hyper parameters of the deep learning model are optimized using Bayesian optimization. The usage frequency of tools is integrated in the model so that the tools which have not been used recently do not appear in the set of possible tools. This is achieved by learning the usage of each tool over time using a support vector regression model.

An API was developed to predict tools and visualize them using a user interface. It can be used in the Galaxy workflow editor (It is not yet available publicly). The API can also be used for multiple-user interface integrations. Using the tool recommendation system, a user does not need to search for the tools in the toolbox to create a workflow. The possible tools are available in the “recommended tools” modal popup.