机器学习可轻松处理繁琐的压裂数据-石油圈

摘要：机器学习技术可准确高效地识别出压裂曲线中的数据节点，帮助工程师们处理繁琐的压裂数据。

编译丨TOM 影子

在水力压裂作业期间，每隔一秒就会记录泵送数据，并将之传输到现场，以csv格式保存。原始泵送数据中包含多个数据道，包括压力、泵速、净体积以及支撑剂浓度。收集数据后，现场工程师可在数据曲线上手动选择事件开始/结束时间、破裂压力、瞬时关井压力(ISIP)和拐点。手动处理过程费时且易出错，而且业内存在不同的解释方法，因此结果可靠性低。为了应对这一挑战，总部位于丹佛的一家油田技术公司正在使用机器学习技术，准确高效地识别大量压裂曲线中的数据节点。

方法

我们的目标是通过训练一种算法来自动选择事件过程，该算法针对的是一个庞大且多样的数据集。虽然这些数据集已被标记为最常见的解释，但没有对其进行显式编程。为此，利用云端软件收集原始数据(. csv文件)，并标准化命名标准与单位，提供空白数据集与高效且有效的方法来可视化处理曲线，以实现最优的阶段选择。

在解决此问题时，该团队使用二进制分类来构建区分节点的模型，每个点标记为0或1。下图展示了如何使用这种方法，属于该阶段的泵送时间数据标记为1，不属于该阶段的泵送时间数据标记为0。模型可根据数据属性(自变量)，赋予它们不同的标签特征(因变量)，并把数据分成三个数据集:训练、验证与测试。该训练算法建立了一个模型，利用训练数据中识别的模式与趋势，来精确地分类(标记)节点。

验证数据集的格式与训练数据集相同，但阶段更少，可以对基于训练数据构建的各种模型进行公正的评价。将验证数据集中生成的0/1预测值，构建为混淆矩阵，并算出准确率、精确率与召回率。高精度、精确性与召回率是准确预测的基础。然后利用测试数据集，评估用于预测节点最终位置的最终模型(程序)。它拥有与训练、验证数据集相同的特性，但没有使用二进制分类。

开始/结束节点

标注出阶段的开始与结束节点非常重要，因为这些界限控制着该阶段的汇总计算。除了定义累积体积与泵送时间外，在计算平均压力、最大压力、泵速、浓度时，需要标记间隔来选择适当的数据。在次过程中，该团队使用泵压、泵速与净容积作为初始特征，来训练逻辑回归分类器。泵送数据集由179个阶段组成，每个变量共有1530445行数据。其中66%的数据用于训练模型，8%用于验证模型，其余26%用于测试模型。这三个数据集都包括各种阶段：一些简单“教科书式”的阶段，其中开始与结束时间很明显；另一些“混乱”的阶段，开始或结束时间比较异常，因此定位开始与结束时间更为困难。

通过计算二进制分类数据之间的差来确定开始与结束时间。若差值为正(标签从0切换到1)，则识别确认为开始时间；若差值为负(标签从1切换到0)，则识别确认为结束时间。这种结合了预测与后期处理 (取差值)的方法，可生成匹配的开始与结束时间列表。每个匹配的事件都被标注出变化的位置、作业时间、井名以及阶段编码。然后，通过过滤筛选来为每个阶段选择最合适的预测时间。尽管第一组试验产生的模型能够非常准确地预测0/1的标签，但组合过程产生了一长串预测值。该程序将泵速与压力的任何微小变化作为起始时间，并指定相应的结束时间，如下图。因此，通过增加额外的特征，可减少产生错误匹配的数量。

在第二轮模型训练中，利用简单的移动平均法，对特征进行预处理，使数据曲线更加平滑。移动平均法通过减少微小波动(噪音)来更好地突出变化趋势。该团队将变化率作为附加特性添加进来，并将L1正则化应用于模型培训。在逻辑回归中，通常使用L1正则化来生成忽略某些特征的模型。得到的模型往往更稳定且更容易解释。最终逻辑回归模型的训练与验证精度约为90%，而且，与手动标记的平均时间误差仅在10秒内。

ISIP标志寄存器

ISIP定义为最终注入压力与井眼、射孔或割缝管摩阻引起的压降之间的差值。ISIP节点位于泵送阶段的末尾，在关井之后、压力开始下降之前。在压力开始下降处画一条直线，结合压力为零的时间点来预估ISIP节点。随着数据收集变得愈发受限，业内对ISIP的选择也有所不同。通常情况下，停泵后则会停止收集数据，以节省操作时间(因此也节省了费用)。因此，泵速降为零后只记录了几秒钟的数据，由于没有压力下降，很难绘制出直线。在停工后，许多现场工程师依靠观察水击效应产生的第一或第二压力峰值来选择ISIP，所以产生了不同的结果。

针对如何选择ISIP节点，利用云端软件从北美各地的盆地收集了870个阶段的标准化数据。采用前文所述的二进制分类方案，来训练神经网络，以识别与分离计算ISIP所需的泵压图区域。如果数据属于泵压图的目标区域，则标记为1，否则标记为0。当神经网络识别出目标区域时，通过去除异常值进一步缩短时间序列。在该过程中，若某个数据与区域平均压力值有一个标准偏差，则被确认为异常点。最后，对降噪后的时间序列进行线性回归，预测泵速为0时的ISIP值。

在分离目标区域时，神经网络实现高达98％的分类准确度（针对训练与验证数据集）。随后对测试数据集进行线性回归，得到的ISIP预测值，与手动选取的值相比，平均准确率为±72psi，中位数准确率为±35psi，如下图所示。

结论

利用分类技术对大量水力压裂曲线的相关区域进行自动标注，可以简单有效的识别关键事件。准确的节点标注使处理大量的压裂数据成为可能，并减少了为质量控制而审查现场数据的时间。该方法的不足之处在于需要对新的现场数据进行周期性的再训练，以保持预测的准确性，提高预测的适应性。使用简单准确的模型有利于部署、调试，并能极其快速的预测以及再训练(更新模型)。

For English, Please click here (展开/收缩)

Machine Learning Helps Pinpoint Events From Fracturing Data

During hydraulic fracturing, pumping data is recorded and mapped in the field at one-second intervals and saved in comma-separated files (.CSV). The raw pumping data includes multiple data channels, including treating pressure, slurry rate, clean volume, and proppant concentration. Once the data is gathered, a field engineer manually picks events such as start and end times, breakdown pressure, instantaneous shut-in pressure (ISIP), and diverter drop from the treatment plots (Fig. 1). This manual process is time consuming, prone to error, and less reliable because of differing interpretation methods across the industry. To address this challenge, a Denver-based oilfield technology company is using machine learning to identify flags in high-frequency treating plots more accurately and consistently.

Methodology

The goal is to automate the event-picking process by training an algorithm against a large and diverse data set that has been labeled with the most common interpretations without explicitly programming it. For this, the raw data (.CSV files) are collected through cloud-based software that standardizes naming conventions and units, providing a clean data set and an efficient and effective way to visualize treating plots for optimal stage selection.

In approaching this problem, the team uses binary classification to build a model that distinguishes points, each labeled as 0 or 1. Fig. 2 shows an example of how this method is used to label the data that belongs to the stage pumping time (1s) and that which does not belong (0s). The data, which consist of features (independent variables) that are used by the model to predict the labels (dependent variable), are split into three data sets: training, validation, and test. The training algorithm builds a model that takes advantage of patterns and trends identified in the training data to classify (label) points accurately.

The validation data set, which has the same format as the training data but with fewer stages, provides an unbiased evaluation of various models constructed on the training data. The 0/1 predictions generated on the validation data set are used to build a confusion matrix and derive accuracy, precision, and recall. High accuracy, precision, and recall values are fundamental to accurate predictions. The test data set is used to evaluate the final model (and procedure) for predicting the final location of the flags. It has the same features as the training and validation set but different stages and is labeled with the correct location of the flags but not with the binary classification column. The complete work flow is shown in Fig. 3.

Start/End Flags

The designation of the stage start and end flags is critical because these boundaries govern summary calculations for that stage. The interval between the flags identifies the appropriate data to include when calculating the average and maximum pressures, rates, and concentrations, in addition to defining the cumulative volumes and pumping time, among others. For this process, the team uses treating pressure, slurry rate, and clean volume as the initial features to train a logistic regression classifier. The pumping data set consists of 179 stages, for a total of 1,530,445 rows of data per variable. Sixty-six percent of the data trains the model, 8% validates it, and the remaining 26% tests it. All three data sets include a variety of stages: some clean “textbook” stages with obvious start and end times and other “messy” stages beginning or ending with unusual behavior, which makes locating the start and end times much less clear.

Identifying the start and end times is achieved by taking the difference between consecutive rows of the predicted binary labels. Start times are identified when the difference is positive (labels switch from 0 to 1) while end times are identified when the difference is negative (labels switch from 1 to 0). This combined prediction and post-processing procedure (taking differences) generates a list of matched start and end times. Each matched pair of events is tagged with the position of the changes, the associated job time, the well name, and stage number. Then, some final filtering is applied to select the most appropriate predictions for each stage. Even though the first group of trials produced models that predict the 0/1 labels very accurately, the combined procedure generated a long list of predictions. The procedure recognized any minor change of slurry rate and pressure as a start time and assigned a corresponding end time (Fig. 4). Thus, additional features and applied techniques are added to reduce the number of false matched pairs generated.

In the second round of model training, the features are preprocessed using a simple moving average to smooth the data. The moving average transforms the time series to better highlight changes in trend by reducing minor fluctuations (noise). The team adds rates of change as additional features and applies L1-regularization to the model training. L1-regularization is often used in logistic regression to produce models that ignore less-relevant features. The resulting models tend to be more robust and easier to interpret. The final logistic regression model has a training and validation accuracy of approximately 90% while the placement of the flags on the test set are within 10 seconds of the manually selected flags on average (Fig. 5).

ISIP Flags

ISIP is defined as the difference between the final injection pressure and the drop in pressure caused by friction in the wellbore and perforations or slotted liner. The ISIP flag is placed at the end of the stage pumping time, immediately after shut-in and before the pressure starts to drop. This is estimated by placing a straight line on the early pressure decline and locating the point in time where the pressure rate is zero. ISIP picks have varied in the industry as data collection becomes more and more constrained. Quite often, data collection ceases after the pumps are shut down to save operational time (and, therefore, money). Thus, only a few seconds are recorded after the rate is zero, making it very difficult to place a straight line because there is no pressure decline. For this reason, many field engineers pick ISIP at the first or second spike of the water-hammer effect observed at shut-down, making this process inconsistent.

For ISIP, 870 stages of standardized data were collected from basins all over North America through a cloud-based software application. Using a similar binary classification scheme as previously described (Fig. 6a), a neural network was trained to identify and isolate the area of the treating pressure plot needed to calculate ISIP. Points are labeled with 1s if they belong to the target area of the treating pressure plot and 0s otherwise. When the neural network identifies the target region, the time-series is further truncated by removing outliers. For this procedure, outliers are identified as points that are more than one standard deviation from the average pressure value of the isolated region. Finally, linear regression is applied to the reduced time series to predict the ISIP value when the slurry rate is zero (Fig. 6b).

The neural network achieved a classification accuracy (on the training and validation sets) of approximately 98% when isolating the target region. The subsequent ISIP predictions from the linear regression on the test data set had an average accuracy of ±72 psi and a median accuracy of ±35 psi when compared with the manually picked values (Fig. 7).

Takeaways

Automatically labeling relevant regions of high-frequency hydraulic fracturing treatment plots using classification techniques can lead to simple and effective procedures for identifying events of interest. Accurate flag selection makes processing large volumes of fracture treatment data viable and reduces the time spent reviewing field data for quality control. A limitation of this method is that it requires periodic retraining with new field data to maintain accuracy and improve the robustness of the prediction. The benefits of using simple (and accurate) models include ease of deployment, ease of debugging, and extremely fast prediction and retraining (updating the model).

未经允许，不得转载本站任何文章：

机器学习可轻松处理繁琐的压裂数据

延伸阅读：

相关推荐