When using a new library such as numpy, you can improve you developer effectiveness by being familiar with the most common questions that come up when using the python numpy library. Using the Stack Overflow Data Explorer tool, we’ve determined the top 10 most popular numpy questions & answers by daily views on Stack Overflow to to be familiar with. Check out the top 10 numpy questions & answers below:
Looking to get a head start on your next software interview? Pickup a copy of the best book to prepare: Cracking The Coding Interview!
1. Unable to allocate array with shape and data type?
This is likely due to your system’s overcommit handling mode.
In the default mode, 0
,
Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. The root is allowed to allocate slightly more memory in this mode. This is the default.
The exact heuristic used is not well explained here, but this is discussed more on Linux over commit heuristic and on this page.
You can check your current overcommit mode by running
$ cat /proc/sys/vm/overcommit_memory
0
In this case, you’re allocating
>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588
~282 GB and the kernel is saying well obviously there’s no way I’m going to be able to commit that many physical pages to this, and it refuses the allocation.
If (as root) you run:
$ echo 1 > /proc/sys/vm/overcommit_memory
This will enable the "always overcommit" mode, and you’ll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).
I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0
I also got a MemoryError
, but after changing it back to 1
it works:
>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056
You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.
2. Valueerror: numpy.ndarray size changed, may indicate binary incompatibility. expected 88 from c header, got 80 from pyobject?
I’m in Python 3.8.5. It sounds too simple to be real, but I had this same issue and all I did was reinstall numpy. Gone.
pip install --upgrade numpy
or
pip uninstall numpy
pip install numpy
3. Convert pandas dataframe to numpy array?
df.to_numpy()
is better than df.values
, here’s why.*
It’s time to deprecate your usage of values
and as_matrix()
.
pandas v0.24.0
introduced two new methods for obtaining NumPy arrays from pandas objects:
to_numpy()
, which is defined onIndex
,Series
, andDataFrame
objects, andarray
, which is defined onIndex
andSeries
objects only.
If you visit the v0.24 docs for .values
, you will see a big red warning that says:
Warning: We recommend using
DataFrame.to_numpy()
instead.
See this section of the v0.24.0 release notes, and this answer for more information.
* – to_numpy()
is my recommended method for any production code that needs to run reliably for many versions into the future. However if you’re just making a scratchpad in jupyter or the terminal, using .values
to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.
Towards Better Consistency: to_numpy()
In the spirit of better consistency throughout the API, a new method to_numpy
has been introduced to extract the underlying NumPy array from DataFrames.
# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},
index=['a', 'b', 'c'])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]])
# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
# [2, 8],
# [3, 9]])
As mentioned above, this method is also defined on Index
and Series
objects (see here).
df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)
df['A'].to_numpy()
# array([1, 2, 3])
By default, a view is returned, so any modifications made will affect the original.
v = df.to_numpy()
v[0, 0] = -1
df
A B C
a -1 4 7
b 2 5 8
c 3 6 9
If you need a copy instead, use to_numpy(copy=True)
.
pandas >= 1.0 update for ExtensionTypes
If you’re using pandas 1.x, chances are you’ll be dealing with extension types a lot more. You’ll have to be a little more careful that these extension types are correctly converted.
a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object) # yuck, objects
# Correct
a.to_numpy(dtype='float', na_value=np.nan)
# array([ 1., 2., nan])
# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1, 2, -1])
This is called out in the docs.
If you need the dtypes
in the result…
As shown in another answer, DataFrame.to_records
is a good way to do this.
df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
This cannot be done with to_numpy
, unfortunately. However, as an alternative, you can use np.rec.fromrecords
:
v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
Performance wise, it’s nearly the same (actually, using rec.fromrecords
is a bit faster).
df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rationale for Adding a New Method
to_numpy()
(in addition to array
) was added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[…] with
.values
it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (likeCategorical
). For example, withPeriodIndex
,.values
generates a newndarray
of period objects each time. […]
to_numpy
aims to improve the consistency of the API, which is a major step in the right direction. .values
will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
Critique of Other Solutions
DataFrame.values
has inconsistent behaviour, as already noted.
DataFrame.get_values()
is simply a wrapper around DataFrame.values
, so everything said above applies.
DataFrame.as_matrix()
is deprecated now, do NOT use!
4. Creating a pandas dataframe from a numpy array: how do i specify the index column and column headers?
You need to specify data
, index
and columns
to DataFrame
constructor, as in:
>>> pd.DataFrame(data=data[1:,1:], # values
... index=data[1:,0], # 1st column as index
... columns=data[0,1:]) # 1st row as the column names
edit: as in the @joris comment, you may need to change above to np.int_(data[1:,1:])
to have correct data type.
5. Most efficient way to map function over numpy array?
I’ve tested all suggested methods plus np.array(map(f, x))
with perfplot
(a small project of mine).
Message #1: If you can use numpy’s native functions, do that.
If the function you’re trying to vectorize already is vectorized (like the x**2
example in the original post), using that is much faster than anything else (note the log scale):
If you actually need vectorization, it doesn’t really matter much which variant you use.
Code to reproduce the plots:
import numpy as np
import perfplot
import math
def f(x):
# return math.sqrt(x)
return np.sqrt(x)
vf = np.vectorize(f)
def array_for(x):
return np.array([f(xi) for xi in x])
def array_map(x):
return np.array(list(map(f, x)))
def fromiter(x):
return np.fromiter((f(xi) for xi in x), x.dtype)
def vectorize(x):
return np.vectorize(f)(x)
def vectorize_without_init(x):
return vf(x)
perfplot.show(
setup=np.random.rand,
n_range=[2 ** k for k in range(20)],
kernels=[f, array_for, array_map, fromiter,
vectorize, vectorize_without_init],
xlabel="len(x)",
)
6. Pandas create new column based on values from other columns / apply a function of multiple columns, row-wise?
OK, two steps to this – first is to write a function that does the translation you want – I’ve put an example together based on your pseudo-code:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick – notice that the parameter going into the function is considered to be a Series object labelled “row”.
Next, use the apply function in pandas to apply the function – e.g.
df.apply (lambda row: label_race(row), axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you’re happy with those results, then run it again, saving the results into a new column in your original dataframe.
df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
7. How to count the occurrence of certain item in an ndarray?
a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
unique, counts = numpy.unique(a, return_counts=True)
dict(zip(unique, counts))
# {0: 7, 1: 4, 2: 1, 3: 2, 4: 1}
Non-numpy way:
Use collections.Counter
;
import collections, numpy
a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
collections.Counter(a)
# Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})
8. Valueerror: setting an array element with a sequence?
From the code you showed us, the only thing we can tell is that you are trying to create an array from a list that isn’t shaped like a multi-dimensional array. For example
numpy.array([[1,2], [2, 3, 4]])
or
numpy.array([[1,2], [2, [3, 4]]])
will yield this error message, because the shape of the input list isn’t a (generalised) “box” that can be turned into a multidimensional array. So probably UnFilteredDuringExSummaryOfMeansArray
contains sequences of different lengths.
Edit: Another possible cause for this error message is trying to use a string as an element in an array of type float
:
numpy.array([1.2, "abc"], dtype=float)
That is what you are trying according to your edit. If you really want to have a NumPy array containing both strings and floats, you could use the dtype object
, which enables the array to hold arbitrary Python objects:
numpy.array([1.2, "abc"], dtype=object)
Without knowing what your code shall accomplish, I can’t judge if this is what you want.
9. Typeerror: only integer scalar arrays can be converted to a scalar index with 1d numpy indices array?
Perhaps the error message is somewhat misleading, but the gist is that X_train
is a list, not a numpy array. You cannot use array indexing on it. Make it an array first:
out_images = np.array(X_train)[indices.astype(int)]
10. Valueerror: the truth value of an array with more than one element is ambiguous. use a.any() or a.all()?
r
is a numpy (rec)array. So r["dt"] >= startdate
is also a (boolean)
array. For numpy arrays the &
operation returns the elementwise-and of the two
boolean arrays.
The NumPy developers felt there was no one commonly understood way to evaluate
an array in boolean context: it could mean True
if any element is
True
, or it could mean True
if all elements are True
, or True
if the array has non-zero length, just to name three possibilities.
Since different users might have different needs and different assumptions, the
NumPy developers refused to guess and instead decided to raise a ValueError
whenever one tries to evaluate an array in boolean context. Applying and
to
two numpy arrays causes the two arrays to be evaluated in boolean context (by
calling __bool__
in Python3 or __nonzero__
in Python2).
Your original code
mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]
looks correct. However, if you do want and
, then instead of a and b
use (a-b).any()
or (a-b).all()
.
Elevate your software skills
Ergonomic Mouse |
Custom Keyboard |
SW Architecture |
Clean Code |