Intensity - A fast and simple data wrangling tool

Intensity solves a problem for every enterprise that is trying to make unstructured and semi-structured data usable for analytics, ML, and RAG workflows.

Intensity is a pick-and-shovel infrastructure layer where it:

turns documents and raw feeds into clean, structured text and tokens
is faster than typical open-source, Python, or JVM stacks
keeps sensitive data local

The value propositions are:

higher throughput per core
lower latency to insight
lower total compute and storage cost

Intensity provides reliable pre-processing that feeds vector stores, feature stores, and warehouses without adding brittle pipeline code.

It quickly wrangles data, providing reliability and privacy as enablers of AI adoption.

The Intensity API

There are five Intensity API functions that can be accessed in several software languages that include C, C++, Python, R, Julia, Java, and Swift:

Alpha(Source File, Target File, Processing Script, Size of the Source File)
writes the processed source file to the target file.
Bravo(Source Buffer, Target File, Processing Script, Size of the Source File)
writes the processed source buffer to the target file.
Charlie(Source Buffer, Processing Script, Size of the Source File, Maximum Size of the Source Buffer)
provides back the processed source buffer.
Delta(Source Buffer, Processing Script Buffer, Size of the Source File, Maximum Size of the Source Buffer)
provides back the processed source buffer.
Echo(Source File, Target File, Processing Script Buffer, Size of the Source File)
writes the processed source file to the target file.

Each API call requires knowledge of the source document, the processing script, and if written, the target file.

Intensity has been used in various evolutionary stages to automate transformation of:

Extremely messy pseudo-XML data into clean, validated XML data.
Email real estate leads into delimited field data imported into a relational database.
Microsoft Office documents into text, then words indirectly imported into a relational database through a microservice.
Linux server logs into PCI DSS audit logs.
Large exported HTML documents transformed into into delimited files, which were later imported into a spread sheet.

Intensity Scripting Commands

The Intensity scripting language contains many commands that can be separated into groups:

Beautify
- BeautifyXML [1 of 2]
  - Format code, indented with tabs.
  - Syntax: BeautifyXML
- BeautifyXML [2 of 2]
  - Format code, indented with tabs and then remove the space between <End> and </End> and <A*> and </A> tags.
  - Syntax: BeautifyXML|FIX_END|
Change
- ChangeTag
  - Replace chosen field (1 is tag1 else tag2) with replacement after locating tag1 then tag2 in sequence.
  - Syntax: ChangeTag|1 for tag1|tag1|tag2|replacement
- ChangeWrappedString
  - Find signature, scan for opening and closing double quotes replace text between quotes with replacement.
  - Syntax: ChangeWrappedString|signature|replacement
Clean
- CleanXML
  - Remove tabs, carriage returns, line feeds and hidden chars when writing final output. Note that StripXML has a higher priority than FormatXML in that StripXML will be used even if FormatXML and StripXML are declared in this file.
  - Syntax: CleanXML
Conceal
- ConcealBlankTags
  - Hide passed tag sequence if there is nothing between the tags.
  - Syntax: ConcealBlankTags|tag_open|tag_close|
- ConcealSpecialTags
  - Hide tags and the data between in a way that finds tags that contain carriage returns/line feeds between elements.
  - Syntax: ConcealSpecialTags|tag_open|tag_contains|tag_close|insert_to_start_of_buffer
Confirm
- ConfirmField
  - Validates a field to ensure it exists, is set as a name => value pair in PHP array format, with leading and trailing characters verified to ensure it is correct format to be loaded and processed within another PHP script.
  - Syntax: ConfirmField|String|
Correct
- CorrectQP
  - Repair quoteable-printable 7-bit email encoding.
  - Syntax: CorrectQP
Eliminate
- EliminateBinary
  - Force deletion of the passed binary data where the second string of three is one, two or three 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘0xFF’ where ‘0x’ is the hex prefix and ‘FF’ can be any hex value from ‘00’ to ‘FF’. Up to three hex values can be passed in the format of ‘0xFF0xFF0xFF’.
  - Syntax: EliminateBinary|tag_open|hex_binary|tag_close|
- EliminateBytes
  - Delete all occurrences of 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘XX’ where ‘XX’ is any hex value from ‘00’ to ‘FF’.
  - Syntax: EliminateBytes|pairs_of_hex_values|
- EliminateContent
  - Delete data that starts with from and ends with to, where to is retained if 0 is passed
  - Syntax: EliminateContent|from|to|0|
- EliminateContentAll
  - Delete all data that starts with from and ends with to, where to is retained if 0 is passed.
  - Syntax: EliminateContentAll|from|to|0
- EliminateField
  - Delete FieldNumber (range 0-many) after Begin and before End by counting occurrences of FieldDelimiter.
  - Syntax: EliminateField|Begin|End|FieldNumber|FieldDelimiter|
- EliminateFirstLine
  - Eliminate first line if it matches ToFind
  - Syntax: EliminateFirstLine|ToFind|
- EliminateFirstToEnd
  - Locate Open, scan forward to First, eliminate until first occurence of End.
  - Syntax: EliminateFirstToEnd|Open|First|End|
- EliminateForward
  - Eliminate all data that starts with Begin, starts with Next and ends with a line feed (0x0A).
  - Syntax: EliminateForward|Begin|Next|
- EliminateFromTo
  - Eliminate all occurrences of Begin to End where End must occur after Begin.
  - Syntax: EliminateFromTo|Begin|Next|
- EliminateLFs
  - Removes line terminators.
  - Syntax: EliminateLFs
- EliminateLines
  - Eliminate every Line-Feed (0x0A) terminated line that contains Begin.
  - Syntax: EliminateLines|Begin|
- EliminateOnLine
  - Eliminate all data on the same line that starts with Begin, ends with End.
  - Syntax: EliminateOnLine|Begin|End|
- EliminatePattern
  - Eliminate all occurrences of Pattern with a trailing Signature.
  - Syntax: EliminatePattern|Pattern|Signature|
- EliminateSpan
  - Eliminate all data that starts with Begin and ends with End.
  - Syntax: EliminateSpan|Begin|End|
- EliminateString
  - Delete all occurrences of passed string from buffer.
  - Syntax: EliminateString|string_to_delete|
- EliminateTag
  - Delete data that starts with tag_open and, if not passed, ends with '>'.
  - Syntax: EliminateTag|tag_open|
- EliminateTag2
  - Locate exact match to tag_open, scan for exact match to tag_next_to_delete, and then delete tag_next_to_delete. Note that this is potentially dangerous in that tag_open and tag_next_to_delete could be separated in context and result in invalid data deletion. It also has limited use in that it could leave a mess behind of deleted opening tags with left over closing tags.
  - Syntax: EliminateTag2|tag_open|tag_next_to_delete|
Preserve
- PreserveMemory
  - Preserve file memory buffer to a file. This command is useful when creating a new processing script because it can write the file buffer at any stage of processing.
  - Syntax: PreserveMemory
Provisional
- ProvisionalUpdate
  - If tag_find not found then insert it before tag_after.
  - Syntax: ProvisionalUpdate|tag_find|tag_after|
Put
- PutBetweenTags
  - Locate exact match to tag_open scan for exact match to tag_next, and then insert tag_to_insert_between_open_and_next between tag_open and tag_next.
  - Syntax: PutBetweenTags|tag_open|tag_next|tag_to_insert_between_open_and_next|
- PutBinaryPostfix
  - Append passed string with Adobe InDesign-specific binary line feed data. See the notes in InsertBinaryPrefix.
  - Syntax: PutBinaryPostfix|tag|
- PutBinaryPrefix
  - Prepend passed string with Adobe InDesign-specific binary line feed data. Note that the binary data is embedded to force InDesign to drop lines feeds after tag closure. This is only needed when the InDesign tag formatting does not specifically call for a line feed to be dropped after a tag closes. It would be best to avoid using InsertBinaryPrefix and InsertBinaryPostfix by handling all line feeds through tag formatting within InDesign.
  - Syntax: PutBinaryPrefix|tag|
- PutField
  - Put Insert at FieldNumber (range 0-n) after Begin and before End counting occurrences of FieldDelimiter.
  - Syntax: PutField|Begin|End|Insert|FieldNumber|FieldDelimiter|
- PutPostfix
  - Insert a string at the end of the file memory buffer.
  - Syntax: PutPostfix|String|
- PutPostfixLine
  - Put Add before each Line Feed (0x0A).
  - Syntax: PutPostfixLine|Add|
- PutPrefix
  - Insert a string at the start of the file memory buffer.
  - Syntax: PutPrefix|String|
- PutPrefixField
  - Prefix FieldNumber (range 0-n) with Prefix that starts with DelimiterBegin and ends with DelimiterEnd, replacing delimiters with ReplaceBegin and ReplaceEnd on each line.
  - Syntax: PutPrefixField|Prefix|DelimiterBegin|DelimiterEnd|ReplaceBegin|ReplaceEnd|LineMax|
- PutPrefixLine
  - Prefix each line in the buffer with a concatination of Prefix + Delimiter.
  - Syntax: PutPrefixLine|Prefix|Delimiter|
- PutString
  - Put a concatination of Field + Delimiter + Filename + Delimiter before each instance of Tag.
  - Syntax: PutString|Tag|Field|Delimiter|Filename|
Reduce
- ReduceLineTerminators
  - Reduce all extraneous line feeds.
  - Syntax: ReduceLineTerminators
- ReduceSpaces
  - Reduce all extraneous spaces.
  - Syntax: ReduceSpaces
Remove
- RemoveBetween
  - Remove data between START and END.
  - Syntax: RemoveBetween|Start|End|
- RemoveWithout
  - If Find is not found in the memory buffer then replace all memory buffer content with Replace.
  - Syntax: RemoveWithout|Find|Replace|
- RemoveWrapper
  - Purge <tag_open> and </tag_open> if located on tag_level and followed by tag_after at tag_level+1.
  - Syntax: RemoveWrapper|tag_level|tag_open|tag_after|
Set
- SetClosingTag
  - Locates tag_open and tag_close when they are both positioned at the same tag level, and then replaces tag_close with new_tag_close.
  - Syntax: SetClosingTag|tag_open|tag_close|new_tag_close|
- SetFieldDelimiter
  - Set field delimiter within curly brackets
  - Syntax: SetFieldDelimiter{delchar}
Swap
- SwapAtNestedLevel
  - Substitute tag_open located at tag_level with new_tag_open and then replaces matching closing TAG with new_tag_close. Note that if before is passed then it has to exist before tag_open for the changes to be made.
  - Syntax: SwapAtNestedLevel|tag_level|before|tag_open|new_tag_open|new_tag_close|
- SwapNested
  - Complex search and substitution for data nested from two to three tag levels.
  - Syntax: SwapNested|sig_tag_root|sig_nested_1_tag|sig_nested_2_tag|sig_tag_close|replace_open|replace_close|
- SwapNext
  - Change first occurrences of from to to from start of file buffer
  - Syntax: SwapNext|from|to|
- SwapOutward
  - Search for primary opening and closing tags. If found, search backward and forward for secondary tags. If found, perform substitution. Why? Because some tags are so generic that SwapNested fails.
  - Syntax: SwapOutward|tag_open|tag_close|previous_tag_open|previous_tag_close|replace_open|replace_close|0=Do not extract text, 1=Extract Test|
- SwapStrings
  - Change all occurrences of from to to.
  - Syntax: SwapStrings|from|to|
- SwapTags
  - Swap sig_tag_open and sig_tag_close with replace_open and replace_close, keeping the data between.
  - Syntax: SwapTags|sig_tag_open|sig_tag_close|replace_open|replace_close|
Transfer
- TransferBlock
  - Locate <tag_open> and </tag_open> located at tag_level_from.
    - If before is populated then determine if it precedes <tag_open> one level before.
    - Do not make any changes if it does not.
    - Extract the data between <tag_open> and </tag_open>,
      hide <tag_open>data</tag_open>,
      move down till Tag Level == tag_level_to
      then start a new block using the passed parameters
      making sure to include the extracted data.
    - Note : tag_open does not have to have a leading '<'.
  - Syntax: TransferBlock|tag_level_from|tag_level_to|before|tag_open|replace_open|replace_close|
Transform
- TransformLFs
  - Change Linefeed (0x0A) and Carriage Return (0x0D) ASCII values to "^LF^" and "^CR^"
  - Syntax: TransformLFs

Creation of Intensity Processing Scripts

Processing scripts can be created and tested by using the Intensity API, or by making use of the Intensity application from the command line as shown below:

Intensity Script Creation Flags.

./ServiceIntensity -test_source path/source_file
-test_target path/target_file
-test_script path/script_file

The flags are used to test creation or refinement of a processing script.

Load the source, script and target files into a text editor.

Change the script then start command line processing to update the target for review.

Most editors will provide notices of target file changes to refresh for review.

Reach out for further information

Richard Evers, CEO/Founder, revers@midnightblue.ca
Waterloo, ON, Canada