Topcoder has been working with a client in the oil and gas exploration industry to develop a tool to assign a particular name (and some other data elements) to a log file. In previous challenges, Quartz Energy - LNAM Prediction Data Science Challenge and LNAM Prediction by Curve Attribute and Vendor Name, our community has developed an application which parses and analyzes the log files which are in LAS text format.
As you can see from the challenge names, we have made some previous attempts to set up a working classification engine. The current implemented random forest model works correctly about 70% of the time with unseen data by looking across all the data elements in the files more broadly. Unfortunately, this doesn’t quite meet the threshold for deployment/usability. This classification problem is trickier than it first appears.
The LAS files contain many attributes but human experts only refer to the curve information attributes and the SRVC element of the log files to make their LNAM assignments. The curve information attributes each refer to a specific instrument that collects data as its travels into a wellbore. The actual data for each tool is recorded in the bottom section of each LAS file but this numeric data doesn’t concern us for classification purposes -- only the presence (or absence) of certain tools.
There’s not a definitive list of tools which are associated to each LNAM. This list needs to be derived from the training data.
The tool mnemonics (the curve information attributes) are (in most cases) not exclusive to a particular LNAM.
There are thousands of unique tools and/or tools names with small variations. This leads to a classic curse of dimensionality problem -- as the number of log files increase the number of potential features expands as well.
For this challenge, we’re going to take a slightly different approach:
We’re going to provide a set of LAS files with only the curve attributes and SRVC element in place.
Our app has the capability of generating reports which display the probability of each of the potential tools associated with each LNAM in a training data set. You can also see the breakdown of LNAM’s themselves across a data set.
We’re going to use a larger data set which is more representative of the general population of LAS files to be processed.
Here are the requirements for the current challenge:
Improve the current classification engine for LNAM prediction based on curve information attributes and the SRVC element . The current application has a “curve-predict” command which makes a prediction based on those elements but it can be improved. For this challenge, we’re only providing those elements in the provided data in any case. More information about the curve analysis functionality currently provided in the app is provided in the Background section below. Usage of the provided curve analysis to improve the classification functionality is optional.
Update the output of the curve-predict command to the following format File Name, UWI, LNAM, Service Company, Log Type, Cased Holed Flag, Generic Toolstring. The logic for cleaning Service Company (SRVC) name, and generating Log Type, Cased Hole Flags and Generic Toolstring is already available elsewhere in the application. We just need to incorporate this functionality into the curve-predict output. This will allow users of the app to run the prediction-analysis tasks on the final output.
Update the process-curve command so that it only includes information about curve information attributes. Right now the reports it produces include ALL the data elements in a LAS file.
Update the prediction-accuracy command to produce precision, recall, and F1 scores as well as accuracy estimates for a given data set.
Update the prediction-analysis command to generate precision, recall, and F1 scores by class (LNAM).
Please document the updates you’ve made to improve #1 and provide a detailed description of your implemented algorithm.
75% of the scoring will be based on the rank of your submission based of F1 score. Your primary goal in this challenge is to maximize your F1 score on our testing data.
25% will be on satisfying the functional requirements outlined above.
Well Logs are divided into the following basic types:
Wireline - Wellbore evaluation that happens asynchronously with drilling activities.
Logging While Drilling (LWD) - provides real time feedback during wellbore and casing evaluations
Surface Logging - MudLogs - examination of drilling output such as natural gas discharge, drilling speed and drilling mud contents to understand underlying geology.
Logging Operators (designated by the SRVC attribute in the files) often specialize in one or two of the different categories. Unfortunately, there is some overlap between tools used for Wireline and LWD operations. Tools originally available only for Wireline operations have been added to LWD toolkits.
The Log Names (LNAM) convey the functional purpose of the logs or a set of tools contained in the log file. For example:
MudLog - MudLog
CBL - Cement Bond Log
CCL - Casing Collar Locator
The curve information section of the logging files looks like the following:
~CURVE INFORMATION SECTION
DEPT .F 00 001 00 00 : 1 DEPTH
AMP3FT.MV 00 01 : 2 AMP3FT AMPLITUDE
AMPAVG. 00 01 : 3 AVERAGE SECTOR AMPLITUDE
AMPMAX. 00 01 : 4 MAXIMUM SECTOR AMPLITUDE
AMPMIN. 00 01 : 5 MINIMUM SECTOR AMPLITUDE
AMPS1 . 00 01 : 6 AMPS1 AMPLITUDE
AMPS2 . 00 01 : 7 AMPS2 AMPLITUDE
AMPS3 . 00 01 : 8 AMPS3 AMPLITUDE
AMPS4 . 00 01 : 9 AMPS4 AMPLITUDE
AMPS5 . 00 01 : 10 AMPS5 AMPLITUDE
AMPS6 . 00 01 : 11 AMPS6 AMPLITUDE
AMPS7 . 00 01 : 12 AMPS7 AMPLITUDE
AMPS8 . 00 01 : 13 AMPS8 AMPLITUDE
GR .GAPI 00 01 : 14 GAMMA RAY
LSPD .F/MN 00 01 : 15 LINE SPEED
LTEN .LB 00 01 : 16 SURFACE LINE TENSION
TEMP . 00 01 : 17 DOWNHOLE TEMPERATURE
TT3FT .US 00 01 : 18 3FT TRAVEL TIME
Of course, the actual list of curve information attributes will vary from file to file.
A little information about the format of this section of the files:
DEPT = short version/code for the instrument name. In industry terms, this is the mnemonic for the tool.
.F = unit of measurement. In this case, feet. Other units for this example, MV = millivolt, GAPI = Gamma Ray API Units, F/MN = Feet per Minute, LB = Pounds, US = microseconds per foot (μs/ft)
Next 4 columns are not important (00 001 00 00 :)
1 = is the column index of the instrument data
DEPTH = a description of the field in English text. DEPTH is of course not a tool but simply the header for the depth field. All instrument values are recorded at a certain depth to reveal details about the underlying geology or about the health of the wellbore itself at a certain distance from the mouth of the wellbore.
The Schlumberger Oilfield Glossary is extremely helpful for defining industry terms.
Our current app has several features that are useful:
1. It parses and loads a set of LAS file attributes into a pandas dataframe.
2. It cleans and normalizes the LAS files in preparation for transformation to pandas. The app generates a clean copy of the data into a new folder.
3. It produces a testing.csv which extracts the LNAM attribute from a directory of LAS files.
4. It can compare the output of the prediction process (e.g., prediction.csv) to the testing.csv (the ground truth) and generate an accuracy score. The process can also generate more detailed classification reports by LNAM.
5. Curve mnemonics are One Hot Encoded in the dataframe.
The app also has the following Curve Information Analysis features which were implemented in the LNAM Prediction by Curve Attribute and Vendor Name challenge:
1. aggregate-curves: the app analyzes a data set of LAS files and groups the files by LNAM for further analysis
2. process-curves: this command generates two reports -- a breakdown of data set by LNAM and displays the absolute number and relative frequency of the LNAM's in the data set. It also generates a report for each LNAM which shows the probability of each data element in the LAS files being included with a particular LNAM. There is a probability threshold for inclusion in the output reports. The recommended setting is .01 so you can see all the mnemonics that might be associated with a particular LNAM.
3. process-srvc: this command generates reports which break down LNAM’s by logging operator. We’re also providing some client-provided information about the Logging Operators in separate data files.
4. curve-predict: generates a csv file which predicts LNAM based on the inputs from the process-curves and process-srvc output. Current output is only UWI and LNAM.
The following directory structure is helpful to host all the reporting output that the app generates. We’ll be posting this analytical output for the training data. But if you want to generate some of it yourself on a subset of the data this directory structure is helpful: