Sidra's integration runtimes using OpenJDK¶
This page intends to clarify some concepts about integration runtimes and give a quick guide about which cost-free runtime environment can be used and how.
Java Runtime is needed¶
The target physical format for the data ingested with Sidra is Parquet files, being optimal for Databricks. The IR agent does not have the native ability to (de)serialize to Parquet; for these, Java libraries are needed. And the IR agent installation kit does not include a Java Runtime. Hence, a Java Runtime must be manually installed on the node, after installing the IR agent.
This can be easily done using the Java installer, which sets everything necessary.
Mind the licensing¶
Recently, Oracle changed the licensing terms for the Java Runtime Environment (JRE): for production environment, fees may apply. The usage of the JRE may not be free-of-charge.
Fortunately, Oracle is contributing to an open-source version - OpenJDK - that can be used. See the install instructions.
Microsoft, with its commitment to protect its customers from licensing claims, is also publishing their own build of the open-source OpenJDK, with an .MSI installer too: The Microsoft Build of OpenJDK.
PATH and Registry keys¶
Unfortunately, the OpenJDK versions above, including the Microsoft's OpenJDK installer -commitment to respecting licenses and disbursement- don't set the Registry keys that the Integration Runtime agent needs for calling the JRE when processing for Parquet files.
According to the troubleshooting article here, the IR agent:
- Needs to be same bitness (64-bit) as the JRE, hence, both will be installed under
C:\Program Files\
(notC:\Program Files (x86)\
). - Checks for the installed version of JRE under
HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment
, valueCurrentVersion
. - Retrieves location of the JRE from
HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment\<versionNumber>
, valueJavaHome
. - Locates the
bin\server
folder in the path retrieved. - Loads the Java Virtual Machine library,
jvm.dll
; if it is not present, thebin\client
is probed for the same DLL.
In addition, for it, is recommendable to set the environment variables for the JRE too:
- The variable
JAVA_HOME
should point to the installation folder; for example:
C:\Program Files\Microsoft\jdk-17.0.3.7-hotspot\
- The variable
PATH
should includes the path to the binaries of JRE; for example:
C:\Program Files\Microsoft\jdk-17.0.3.7-hotspot\bin
Example¶
On installing the Oracle's Java 8, the installer is adding Registry keys under:
HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft
Exporting that key, we end up with a text file bearing the .REG extension, looking similar to the below. Then, Oracle's Java 8 may be uninstalled.
The default Java location when we install using the Microsoft's build of OpenJDK, on Windows, is:
C:\Program Files\Microsoft\jdk-17.0.3.7-hotspot\
After adapting the Registry file - the corresponding keys - updating the version and location of the JRE and disregarding the browser plugin entries:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft]
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Plug-in]
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Plug-in\11.333.2]
"JavaHome"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot"
"UseJava2IExplorer"=dword:00000001
"UseNewJavaPlugin"=dword:00000001
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment]
"CurrentVersion"="1.17"
"BrowserJavaVersion"="11.333.2"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment\1.17]
"RuntimeLib"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin\\server\\jvm.dll"
"JavaHome"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot"
"MicroVersion"="0"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment\1.17.0.3.7-hotspot]
"JavaHome"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot"
"MicroVersion"="0"
"RuntimeLib"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin\\server\\jvm.dll"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment\1.17.0.3.7-hotspot\MSI]
"INSTALLDIR"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\"
"JU"=""
"OEMUPDATE"=""
"FROMVERSION"="NA"
"FROMVERSIONFULL"=""
"PRODUCTVERSION"="17.0.3.7-hotspot"
"EULA"=""
"JAVAUPDATE"="1"
"AUTOUPDATECHECK"="1"
"AUTOUPDATEDELAY"=""
"FullVersion"="17.0.3.7-hotspot"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Update]
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Update\Policy]
"Country"="ES"
"PostStatusUrl"=https://sjremetrics.java.com/b/ss//6
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start]
"CurrentVersion"="11.333.2"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\1.0.1]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\1.0.1_02]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\1.0.1_03]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\1.0.1_04]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\1.2]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\1.2.0_01]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start\11.333.2]
"Home"="C:\\Program Files\\Microsoft\\jdk-17.0.3.7-hotspot\\bin"
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start Caps]
[HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Web Start Caps\11.333.2]
"JNLPProtocol"=dword:00000001
"JNLPAssociation2"=dword:00000001
Visual C++ runtime¶
The Java Virtual Machine of the JRE - <jre-path>\bin\server\jvm.dll
- takes a dependency on Visual C++ runtime libraries, as illustrated below:
# Using Visual Studio Build Tools, location:
# C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\Hostx64\x64\
.\dumpbin.exe /DEPENDENTS "C:\Program Files\Microsoft\jdk-17.0.3.7-hotspot\bin\server\jvm.dll"
...
Image has the following dependencies:
...
VCRUNTIME140.dll
VCRUNTIME140_1.dll
...
Process Monitor traces show that C:\Windows\System32\vcruntime140_1.dll
is being loaded (among others); if Visual C++ runtime libs are missing, then jvm.dll
of the JRE can't be loaded either.
For the above specific version of Microsoft's build of OpenJDK, installing the version 17 from the Microsoft Visual C++ Redistributive downloads page, then applying the above Registry keys, resulted in a working Data Factory Integration Runtime node, one able to process Parquet files.
Avoid OutOfMemoryError¶
When handling large amounts of data with Parquet files, the Java-based libraries may cause the JRE to go into high fragmentation of the free memory space in Java heaps. This will cause, ultimately, failures in the Data Factory activities involving Parquet files; such failures would show in the error message something like:
java.lang.OutOfMemoryError:Java heap space
To help avoiding such occurrences, we could "tell" the Java runtime to use a larger memory space for its heaps. Do so by adding a system-wide environment variable with the minimum and maximum value for heap sizes.
_JAVA_OPTIONS = -Xms512m -Xmx16gJAVA_TOOL_OPTIONS = -Xms512m -Xmx16g
According to Microsoft documentation, a _JAVA_OPTIONS
should be added, and it works in the tests we performed. But according to comments in this post, the supported environment variable name is JAVA_TOOL_OPTIONS
.
The flag Xms
specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx
specifies the maximum memory allocation pool. This means that JVM will be started with Xms
amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, the JRE uses min 64 MB and max 1GB.
After setting Java heap memory limits, reboot the machine. We may check memory settings by looking for the values MinHeapSize
/InitialHeapSize
and MaxHeapSize
/SoftMaxHeapSize
in the output of:
java -XX:+PrintFlagsFinal -version