Vitok-TEXT

"Vitok-TEXT" - special search system in the unstructured text data. The system is intended for accumulating, analysis and search in the unstructured text data, supporting a large number of source file formats and different means of data flow for processing. The system saves the accumulated information, including different attributes of source files, in the database optimized for quick search.

Area of application

The system can be applied for the search in the unstructured text data, accumulated by different organizations.

Data sources

The following systems can be used as the data sources:

  • File system (the folder in the disk);
  • DBMS MS SQL Server.

It is possible to create refillable sources that monitor processed and unprocessed files.

sheme_20_2_eng_CMYK_fl

File processing

The system extracts a text from a large amount of file formats:

  • MS Office: doc, docx, xls, xlsx, xlsm, ppt, pptx, pptm;
  • OpenOffice: odt, ods, sxw;
  • The rest: txt, rtf, pdf, html, mht, xml, eml, wpd.

The system processes password unprotected archives (also self-extracting ones) of rar, zip, gzip, tar, tgz, bz2 formats. It is possible to develop specialized parsers of the structured files for extracting and saving the satellite information.

Text encoding

Automatic recognition of the processed text coding. A wide range of encodings is supported:

  • Windows-125x family;
  • ISO-8859-x family;
  • UTF-x family;
  • KOI-8-x family;
  • IBM866.

Language of the text

Automatic recognition of the text language:

  • Slavic languages: Russian, Ukrainian, Belarusian, Polish, Bulgarian and others;
  • Languages of the European Union: English, French, German, Italian, Spanish and others;
  • Languages with specific scrip: Arabian, Persian, Hebrew and others.

Text-processing

The texts are analyzed morphologically in order to recognize the word's initial form. The morphological support is realized for all recognizable languages. The correction of spelling errors in the source text and search query is possible for the Russian language.

Object selection 

The objects are selected during text processing. The types of selected objects form three basic classes:

  • Template objects: telephone numbers, document numbers, vehicle identification numbers, etc. Conversion of different means of record of the same object is performed. (17.04.2014<->2014.04.17<->April 17, 2014).
  • Word objects: surnames, names, patronymics, address objects, traffic centers, makes and models of cars, etc.
  • Dates. Different types of methods of recording, full and not full (without year indication) dates.

Object dictionaries are refillable, including batch load from text files.

Categorization

Text categorization is performed to define the thematic focus of the text on the basis of directory of words and word combinations. The directories are refillable with the possibility of creation of new headings.

Classification

Classification of texts is performed: definition of the thematic focus area of the text with the help of the classifier based on the use of training texts. Statistical analysis of the texts selected for training is performed to define the criteria significant for the specified subject. It is possible to integrate the function of training text selection to the operator’s workplace.

Definition of the type of the document

The analysis of the text formatting is carried out (for file formats providing this option) to define the type of the document in accordance with the set of templates. Document template editor allows to specify the arrangement of text blocks on the page, occurrence of certain words, some features of formatting.

Search

  • The query language is developed to form the search queries. It supports logical operators "and", "or", "not", the operator of the distance between words and the operator of morphology deactivation. Common words as well the objects can be used as query elements.
  • The filters can be used for searching the attribute values. Examples of filters:
  1. time range;
  2. headings, subjects, type of the document;
  3. additional attributes of the source text.
  • The query result includes the fragment of the found text including the occurrence of matching words and also saved text attributes. The view of the full text of the document with multi-colored highlight of the words and objects with the possibility of navigation between them is available.

 The user interface

The system presents different means of realization of the user interface: web-interface, different variants of applications for installation at operators' workplaces, program interface.

Possibilities of integration 

Possibility to integrate the systems of textual analysis with the other products of the company on the basis of:

  • Data input for processing;
  • Using the results of search queries to solve the analytical tasks.

Brochure "Vitok-TEXT" (pdf)