Topcoder Challenge | Topcoder Community

Challenge Overview

Topcoder has been working with a client in the oil and gas exploration industry to develop a tool to assign a particular name (and some other data elements) to a log file. In previous challenges, Quartz Energy - LNAM Prediction Data Science Challenge and LNAM Prediction by Curve Attribute and Vendor Name, our community has developed an application which parses and analyzes the log files which are in LAS text format.

As you can see from the challenge names, we have made some previous attempts to set up a working classification engine. The current implemented random forest model works correctly about 70% of the time with unseen data by looking across all the data elements in the files more broadly. Unfortunately, this doesn’t quite meet the threshold for deployment/usability. This classification problem is trickier than it first appears.

The LAS files contain many attributes but human experts only refer to the curve information attributes and the SRVC element of the log files to make their LNAM assignments. The curve information attributes each refer to a specific instrument that collects data as its travels into a wellbore. The actual data for each tool is recorded in the bottom section of each LAS file but this numeric data doesn’t concern us for classification purposes -- only the presence (or absence) of certain tools.
There’s not a definitive list of tools which are associated to each LNAM. This list needs to be derived from the training data.
The tool mnemonics (the curve information attributes) are (in most cases) not exclusive to a particular LNAM.
There are thousands of unique tools and/or tools names with small variations. This leads to a classic curse of dimensionality problem -- as the number of log files increase the number of potential features expands as well.

For this challenge, we’re going to take a slightly different approach:

We’re going to provide a set of LAS files with only the curve attributes and SRVC element in place.
Our app has the capability of generating reports which display the probability of each of the potential tools associated with each LNAM in a training data set. You can also see the breakdown of LNAM’s themselves across a data set.
We’re going to use a larger data set which is more representative of the general population of LAS files to be processed.

Here are the requirements for the current challenge:

Improve the current classification engine for LNAM prediction based on curve information attributes and the SRVC element . The current application has a “curve-predict” command which makes a prediction based on those elements but it can be improved. For this challenge, we’re only providing those elements in the provided data in any case. More information about the curve analysis functionality currently provided in the app is provided in the Background section below. Usage of the provided curve analysis to improve the classification functionality is optional.
Update the output of the curve-predict command to the following format File Name, UWI, LNAM, Service Company, Log Type, Cased Holed Flag, Generic Toolstring. The logic for cleaning Service Company (SRVC) name, and generating Log Type, Cased Hole Flags and Generic Toolstring is already available elsewhere in the application. We just need to incorporate this functionality into the curve-predict output. This will allow users of the app to run the prediction-analysis tasks on the final output.
Update the process-curve command so that it only includes information about curve information attributes. Right now the reports it produces include ALL the data elements in a LAS file.
Update the prediction-accuracy command to produce precision, recall, and F1 scores as well as accuracy estimates for a given data set.
Update the prediction-analysis command to generate precision, recall, and F1 scores by class (LNAM).
Please document the updates you’ve made to improve #1 and provide a detailed description of your implemented algorithm.

Scoring

75% of the scoring will be based on the rank of your submission based of F1 score. Your primary goal in this challenge is to maximize your F1 score on our testing data.

25% will be on satisfying the functional requirements outlined above.

Background

Well Logs are divided into the following basic types:

Wireline - Wellbore evaluation that happens asynchronously with drilling activities.
Logging While Drilling (LWD) - provides real time feedback during wellbore and casing evaluations
Surface Logging - MudLogs - examination of drilling output such as natural gas discharge, drilling speed and drilling mud contents to understand underlying geology.

Logging Operators (designated by the SRVC attribute in the files) often specialize in one or two of the different categories. Unfortunately, there is some overlap between tools used for Wireline and LWD operations. Tools originally available only for Wireline operations have been added to LWD toolkits.

The Log Names (LNAM) convey the functional purpose of the logs or a set of tools contained in the log file. For example:

MudLog - MudLog

CBL - Cement Bond Log

CCL - Casing Collar Locator

The curve information section of the logging files looks like the following:

~CURVE INFORMATION SECTION

DEPT .F 00 001 00 00 : 1 DEPTH

AMP3FT.MV 00 01 : 2 AMP3FT AMPLITUDE

AMPAVG. 00 01 : 3 AVERAGE SECTOR AMPLITUDE

AMPMAX. 00 01 : 4 MAXIMUM SECTOR AMPLITUDE

AMPMIN. 00 01 : 5 MINIMUM SECTOR AMPLITUDE

AMPS1 . 00 01 : 6 AMPS1 AMPLITUDE

AMPS2 . 00 01 : 7 AMPS2 AMPLITUDE

AMPS3 . 00 01 : 8 AMPS3 AMPLITUDE

AMPS4 . 00 01 : 9 AMPS4 AMPLITUDE

AMPS5 . 00 01 : 10 AMPS5 AMPLITUDE

AMPS6 . 00 01 : 11 AMPS6 AMPLITUDE

AMPS7 . 00 01 : 12 AMPS7 AMPLITUDE

AMPS8 . 00 01 : 13 AMPS8 AMPLITUDE

GR .GAPI 00 01 : 14 GAMMA RAY

LSPD .F/MN 00 01 : 15 LINE SPEED

LTEN .LB 00 01 : 16 SURFACE LINE TENSION

TEMP . 00 01 : 17 DOWNHOLE TEMPERATURE

TT3FT .US 00 01 : 18 3FT TRAVEL TIME

Of course, the actual list of curve information attributes will vary from file to file.

A little information about the format of this section of the files:

DEPT = short version/code for the instrument name. In industry terms, this is the mnemonic for the tool.

.F = unit of measurement. In this case, feet. Other units for this example, MV = millivolt, GAPI = Gamma Ray API Units, F/MN = Feet per Minute, LB = Pounds, US = microseconds per foot (μs/ft)

Next 4 columns are not important (00 001 00 00 :)

1 = is the column index of the instrument data

DEPTH = a description of the field in English text. DEPTH is of course not a tool but simply the header for the depth field. All instrument values are recorded at a certain depth to reveal details about the underlying geology or about the health of the wellbore itself at a certain distance from the mouth of the wellbore.

The Schlumberger Oilfield Glossary is extremely helpful for defining industry terms.

Our current app has several features that are useful:

1. It parses and loads a set of LAS file attributes into a pandas dataframe.

2. It cleans and normalizes the LAS files in preparation for transformation to pandas. The app generates a clean copy of the data into a new folder.

3. It produces a testing.csv which extracts the LNAM attribute from a directory of LAS files.

4. It can compare the output of the prediction process (e.g., prediction.csv) to the testing.csv (the ground truth) and generate an accuracy score. The process can also generate more detailed classification reports by LNAM.

5. Curve mnemonics are One Hot Encoded in the dataframe.

The app also has the following Curve Information Analysis features which were implemented in the LNAM Prediction by Curve Attribute and Vendor Name challenge:

1. aggregate-curves: the app analyzes a data set of LAS files and groups the files by LNAM for further analysis

2. process-curves: this command generates two reports -- a breakdown of data set by LNAM and displays the absolute number and relative frequency of the LNAM's in the data set. It also generates a report for each LNAM which shows the probability of each data element in the LAS files being included with a particular LNAM. There is a probability threshold for inclusion in the output reports. The recommended setting is .01 so you can see all the mnemonics that might be associated with a particular LNAM.

3. process-srvc: this command generates reports which break down LNAM’s by logging operator. We’re also providing some client-provided information about the Logging Operators in separate data files.

4. curve-predict: generates a csv file which predicts LNAM based on the inputs from the process-curves and process-srvc output. Current output is only UWI and LNAM.

The following directory structure is helpful to host all the reporting output that the app generates. We’ll be posting this analytical output for the training data. But if you want to generate some of it yourself on a subset of the data this directory structure is helpful:

Final Submission Guidelines

1. A submission.csv file which contains the following columns for the testing data: File Name, UWI, LNAM, Service Company, Log Type, Cased Holed Flag, Generic Toolstring. The full requirements for this file are described above.
2. Your complete solution code. Please use the existing codebase provided in the forums as a starting point. You’re welcome to make any modifications necessary to improve "curve-predict" classification results.
3. Please update documentation accordingly to explain in detail the changes you made to improve the algorithm. Analytical insights are appreciated.
4. Update the requirements.txt file with any new dependencies that you introduce.

LNAM Prediction by Curve Attribute and Vendor Name - Part 2

Key Information

Challenge Overview

Final Submission Guidelines

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30091925