MaRDA Metadata Extractors Schema

This file describes the Extractor model created by the MaRDA extractors WG.

URI: https://marda-alliance.github.io/metadata_extractors_schema/main/mme_schema/

Name: extractor

Schema Diagram

erDiagram Extractor { string id string name string description stringList subject string source_repository string documentation string instructions } Citation { string uri stringList creators stringList contributors string title string type } License { string uri string spdx } SupportedFileType { string id string description } UsageTemplate { string input_path string input_type string output_path string output_type } Usage { UsageTypes method string setup string command UsageScope scope stringList supported_filetypes } Installation { InstallerTypes method string requires_python string requirements stringList packages } FileType { string id string description string name stringList subject stringList associated_vendors stringList associated_instruments stringList associated_software stringList associated_file_extensions stringList associated_standards stringList registered_extractors } Extractor ||--}o Citation : "citations" Extractor ||--|| License : "license" Extractor ||--}| SupportedFileType : "supported_filetypes" Extractor ||--}o SupportedFileType : "supported_output_filetypes" Extractor ||--}o Usage : "usage" Extractor ||--}o Installation : "installation" SupportedFileType ||--|o UsageTemplate : "template" FileType ||--}o FileType : "base_formats"

Classes

Class

Description

Citation

A container for a citation or another form of attribution for the parent resource.

Extractor

A script, code, or web service that, when executed, can extract information from a supplied “file” with a specific FileType.

FileType

A specific encoding of data for storage purposes. A FileType is defined by a set of common characteristics and expectations, that can be assumed for all files of a given file type.

Installation

A machine-actionable specification of a set of installation instructions for the parent Extractor.

License

A container for the licensing information related to the parent resource.

SupportedFileType

An specification of a FileType, supported by the parent Extractor.

Usage

A machine-actionable specification of a set of usage instructions of the parent Extractor.

UsageTemplate

A container for specifying string substitution templates for usage specification, see Usage class.

Slots

Slot

Description

associated_file_extensions

A list of any known file extensions that files of this FileType are found
with. These may be used as a hint for FileType detection. Should omit the
leading '.', e.g. ‘json’ or ‘txt’.

associated_instruments

A list of any instruments, or classes of instruments, that typically create the
data encoded into this FileType.

associated_software

A list of any known software (proprietary or otherwise) that produces such
FileType.

associated_standards

A list of any particular well-defined file format standards relevant to this
FileType, e.g., CIF, NeXus, then it can be listed here.

associated_vendors

A list of software or instrument vendors that can be associated with this
particular FileType.

base_formats

A list of any particular underlying generic formats which this FileType is
based on, e.g., CSV, JSON, HDF5, XML.

citations

A citation or citations for the entry, to be provided should it be used in
academic work (or otherwise).

command

A machine-executable command by which the Extractor functionality can be
accessed.

contributors

A list of the contributors to the resource.

creators

A list of the creators of the resource.

description

A human-readable outline of the entry, its format, data content and uses.

documentation

A URL or URI for any online documentation associated with this extractor.

id

A unique identifier for the entry within the MaRDA registry namespace, this
should be a shorthand label rather than a UUID. Only lower-case alphanumeric and
dash (“-”) characters are permitted.

input_path

The location of the resource (e.g., file, or directory) on disk to be extracted.
Required.

input_type

The FileType of the input_path. Defaults to the FileType->id for each
supported file type.

installation

A machine-actionable set of installation instructions to obtain a working set-up
of the Extractor.

instructions

Any human-readable usage notes or installation instructions for this
Extractor. This field is intended for human use only and is not intended to be
machine-actionable. Please use the Extractor->installation and
Extractor->usage slots for that purpose.

license

A URL, URI or SPDX license identifier for a legal document giving official
permission to do something with the resource.

method

Usage invocation method, e.g. from a command line or from Python.

name

A recognisable name for the entry.

output_path

The location where the output of the extraction will be written to disk.
Defaults to the Extractor default.

output_type

The FileType of the output_path, for Extractors supporting multiple output
FileTypes. Defaults to the FileType->id for each supported output file type.

packages

A list of packages, including versioned packages or git+https:// targets, to
be installed using the Installer.

registered_extractors

A slot for an automatically-generated enumeration of Extractor IDs that
support this file type. This slot should be auto-populated by a registry.

requirements

Contents of a ‘requirements.txt’-like file. Will be installed by the selected
Installer using an appropriate method, e.g., pip install -r requirements.txt
for pip, or conda env create -f requirements.txt for conda.

requires_python

A PEP 440 version string for the version constraints on the Python version
required for this extractor.

scope

Specification of extraction scope.

setup

Any necessary setup step for the ‘command’ to become available.

source_repository

A URL or URI for a source code repository associated with this extractor.

spdx

An SPDX License Identifier entry.

subject

Any keywords, phrases or classification codes that are relevant to the entry,
e.g., particular scientific domains of applicability, or experimental
techniques.

supported_filetypes

An enumeration of the FileType that an Extractor supports, matching
FileTypes present in the registry. The FileType->id slot can be passed to
the Extractor, see the Usage class.

supported_output_filetypes

An enumeration of the possible output formats of an Extractor. These should
match FileTypes present in the registry. They can be specified on extractor
execution using the templates described in the Extractor->Usage->command slot,
see the Usage class.

template

A mechanism for overriding the template values for this file type in the usage
instructions.

title

A name given to the resource [from DC].

type

Any bibliographic resource type (e.g., article, dataset, software) enumerated in
the DCMI Type Vocabulary.

uri

An unambiguous reference to the resource within a given context.

usage

A machine-actionable instructions for the usage of the Extractor. The described
usage pattern shall be available after the instructions specified in
Extractor->installation slot have been followed.

Enumerations

Enumeration

Description

InstallerTypes

This enumeration allows the Extractor->installation to specify which installer
software is to be used for the installation of the Extractor.

UsageScope

This enumeration specifies the scope of extraction performed by the Extractor.

UsageTypes

This enumeration specifies the execution environment for the Extractor.

Types

Type

Description

Boolean

A binary (true or false) value

Curie

a compact URI

Date

a date (year, month and day) in an idealized calendar

DateOrDatetime

Either a date or a datetime

Datetime

The combination of a date and time

Decimal

A real number with arbitrary precision that conforms to the xsd:decimal
specification

Double

A real number that conforms to the xsd:double specification

Float

A real number that conforms to the xsd:float specification

Integer

An integer

Jsonpath

A string encoding a JSON Path. The value of the string MUST conform to JSON
Point syntax and SHOULD dereference to zero or more valid objects within the
current instance document when encoded in tree form.

Jsonpointer

A string encoding a JSON Pointer. The value of the string MUST conform to JSON
Point syntax and SHOULD dereference to a valid object within the current
instance document when encoded in tree form.

Ncname

Prefix part of CURIE

Nodeidentifier

A URI, CURIE or BNODE that represents a node in a model.

Objectidentifier

A URI or CURIE that represents an object in the model.

Sparqlpath

A string encoding a SPARQL Property Path. The value of the string MUST conform
to SPARQL syntax and SHOULD dereference to zero or more valid objects within the
current instance document when encoded as RDF.

String

A character string

Time

A time object represents a (local) time of day, independent of any particular
day

Uri

a complete URI

Uriorcurie

a URI or a CURIE