NLP_tools

Niveau d'utilisation :  Débutant
Niveau de validation :  Expérimental
Objectif

La bibliothèque d’outils pour le traitement du langage naturel, NLP-tools (construite au dessus de spaCy), contient des composants qui effectuent des traitements lexicaux, morphologiques et syntaxiques de base sur des corpus textuels.

Les composants disponibles effectuent des traitements de :

  • Stemming [engine = stemmer], en français et en anglais = racinisation des mots sous forme de stem.
       Ex: "dissection"    ->    "dissect"

 

  • Etiquetage en partie du discours [engine = POStagger] avec lemmatisation, en français et en anglais = affectation d’une catégorie syntaxique et production de la forme lemmatique de chaque mot.
      Ex: disséquera -> {« orth »: « disséquera », « pos »: « V », « lemma »: « disséquer »}

 

  • Chunking nominal, en anglais :
    • [engine = npchunker ]  extraction des groupes nominaux classique (analyse en constituant)
    • [engine = npchunkerdp ]  extraction des groupes nominaux basée sur de l’analyse en dépendance
      Ex:   "fleur_bleue", "muscle_strié _cardiaque", 
             "production_intérieur_de _gaz_de_la_Russie"

 

  • Reconnaissance terminologique [engine = termatcher] sur la ressource MX2015 (MX étant le vocabulaire de la base Pascal)  =
      Ex: " Non-local effects by homogenization or 3D–1D dimension reduction in elastic materials reinforced by stiff fibers.We first consider an elastic thin heterogeneous cylinder of radius of order ε: the interior of the cylinder is occupied by a stiff material (fiber) that is surrounded by a soft material "
         ->
         " Non-MX_local_effects by MX_homogenization or 3D–1D MX_dimension_reduction in MX_elastic_materials reinforced by stiff MX_fibers .We first consider an elastic thin heterogeneous MX_cylinder of MX_radius of MX_order ε: the interior of the MX_cylinder is occupied by a stiff MX_material (MX_fiber ) that is surrounded by a MX_soft_material "

 
En fonction du paramètre output indiqué et la nature du traitement, le résultat sera :

  • * le texte, produit de la transformation du texte d’origine (output=doc)
  • * une structure d’information au format json (output=json) plus complète qui contient toutes les métadonnées issues de l’analyse.

URLs du web service

    https://nlp-tools-2.services.istex.fr/v1/{langue}/{engine}/analyze?output={val} 
    • {langue}                       la langue à analyser           [en , fr]
    • {engine}                      nom pipeline de traitement à appliquer :
                     anglais :           [stemmer, postagger, npchunker, npchunkerdp]
                     francais :          [stemmer , postagger]
    • paramètres :
      {output}                       format du résultat           [doc , json]
      doc = le résultat est réinséré dans le document
      json = le résultat de l’analyse au format json

      Listes des routes

      Description de la tâche français anglais engine
      Stemming     /v1/fr/stemmer/analyze        /v1/en/stemmer/analyze stemmer
      Etiquetage en partie du discours /v1/fr/postagger/analyze /v1/en/postagger/analyze postagger
      Reconnaissance de termes contrôlés /v1/en/termmatcher/analyze  termmatcher
      Chunking nominal /v1/en/npchunker/analyze  NPchunker
      Chunking nominal issu d’une analyse en dépendance /v1/en/npchunkerdp/analyze  NPchunkerDP

      Code retour

      • 200 si OK
      • 404 si service non contacté

      L’analyse linguistique avec spaCy

      Pour une meilleure compréhension des formats d’analyse et des mécanismes impliqués dans NLP-tools, se référer à la documentation spaCy

       

      Exemple textuel du traitement
      Le format d'entrée :

      Exemple d’interrogation du chuncker en anglais, sortie doc :

      • route :  /v1/en/npchunker/analyze       
      • format de sortie  :  output=doc

      construit  l’url :  https://nlp-tools-2.services.istex.fr/v1/en/npchunker/analyze?output=doc

      
      cat <<EOF | curl --proxy "" -X POST --data-binary @- "https://nlp-tools-2.services.istex.fr/v1/en/npchunker/analyze?indent=true&output=doc"
      [{
      "idt":"08-0245642","value":"Random walk of passive tracers among randomly moving obstacles. Background: This study is mainly motivated by the need of understanding how the diffusion behaviour of a biomolecule (or even of a larger object) is affected by other moving macromolecules, organelles, and so on, inside a living cell, whence the possibility of understanding whether or not a randomly walking biomolecule is also subject to a long-range force field driving it to its target. Method: By means of the Continuous Time Random Walk (CTRW) technique the topic of random walk in random environment is here considered in the case of a passively diffusing particle in a crowded environment made of randomly moving and interacting obstacles. Results: The relevant physical quantity which is worked out is the diffusion cofficient of the passive tracer which is computed as a function of the average inter-obstacles distance. Coclusions: The results reported here suggest that if a biomolecule, let us call it a test molecule, moves towards its target in the presence of other independently interacting molecules, its motion can be considerably slowed down. Hence, if such a slowing down could compromise the efficiency of the task to be performed by the test molecule, some accelerating factor would be required. Intermolecular electrodynamic forces are good candidates as accelerating factors because they can act at a long distance in a medium like the cytosol despite its ionic strength."
      },{
      "idt":"08-040289","value":"Planck 2015 results. XIII. Cosmological parameters.We present results based on full-mission Planck observations of temperature and polarization anisotropies of the CMB. These data are consistent with the six-parameter inflationary LCDM cosmology. From the Planck temperature and lensing data, for this cosmology we find a Hubble constant, H0= (67.8 +/- 0.9) km/s/Mpc, a matter density parameter Omega_m = 0.308 +/- 0.012 and a scalar spectral index with n_s = 0.968 +/- 0.006. (We quote 68% errors on measured parameters and 95% limits on other parameters.) Combined with Planck temperature and lensing data, Planck LFI polarization measurements lead to a reionization optical depth of tau = 0.066 +/- 0.016. Combining Planck with other astrophysical data we find N_ eff = 3.15 +/- 0.23 for the effective number of relativistic degrees of freedom and the sum of neutrino masses is constrained to < 0.23 eV. Spatial curvature is found to be |Omega_K| < 0.005. For LCDM we find a limit on the tensor-to-scalar ratio of r <0.11 consistent with the B-mode constraints from an analysis of BICEP2, Keck Array, and Planck (BKP) data. Adding the BKP data leads to a tighter constraint of r < 0.09. We find no evidence for isocurvature perturbations or cosmic defects. The equation of state of dark energy is constrained to w = -1.006 +/- 0.045. Standard big bang nucleosynthesis predictions for the Planck LCDM cosmology are in excellent agreement with observations. We investigate annihilating dark matter and deviations from standard recombination, finding no evidence for new physics. The Planck results for base LCDM are in agreement with BAO data and with the JLA SNe sample. However the amplitude of the fluctuations is found to be higher than inferred from rich cluster counts and weak gravitational lensing. Apart from these tensions, the base LCDM cosmology provides an excellent description of the Planck CMB observations and many other astrophysical data sets."
      }]
      EOF
      
      Le résultat :
      [{
      "idt": "08-0245642",
      "value": "random_walk passive_tracer move_obstacle diffusion_behaviour live_cell walking_biomolecule range_force_field continuous_time_random time_random_walk random_walk random_environment diffuse_particle crowded_environment interact_obstacle relevant_physical_quantity diffusion_cofficient passive_tracer test_molecule interact_molecule test_molecule accelerate_factor intermolecular_electrodynamic_force good_candidate accelerate_factor long_distance ionic_strength"
      },
      {
      "idt": "08-040289",
      "value": "cosmological_parameter mission_planck_observation observation_of_temperature polarization_anisotropy inflationary_lcdm_cosmology planck_temperature lense_datum matter_density_parameter scalar_spectral_index measured_parameter planck_temperature lense_datum planck_lfi_polarization lfi_polarization_measurement optical_depth depth_of_tau combine_planck effective_number relativistic_degree degree_of_freedom neutrino_masse spatial_curvature scalar_ratio ratio_of_r mode_constraint keck_array bkp_datum constraint_of_r evidence_for_isocurvature isocurvature_perturbation cosmic_defect equation_of_state dark_energy standard_big_bang big_bang_nucleosynthesis bang_nucleosynthesis_prediction planck_lcdm_cosmology excellent_agreement agreement_with_observation annihilate_dark_matter standard_recombination new_physics planck_result result_for_base base_lcdm agreement_with_bao bao_datum rich_cluster_count base_lcdm_cosmology excellent_description planck_cmb_observation astrophysical_datum_set"
      }]
      
      En poursuivant votre navigation, sans modifier vos paramètres, vous acceptez l'utilisation et le dépôt de cookies destinés à mesurer la fréquentation du site grâce au logiciel Matomo.
      OK
      Modifier les paramètres